Team SIX Member:
- Ananta Arora (SID: 100421624)
- Jinghao Chen (SID: 100406201)
- Roxanne Alvarez (SID: 100405742)
- Teshani Jayasinghe (SID: 100422405)
Summary¶
This is a general summary.
- Assess the general characteristics of the dataset
- How many records do we have? How many variables?
- What are the variable names? Are they meaningful?
- What type is each variable
- How many unique values does each variable have?
- What value occurs most frequently, and how often does it occur?
- Are there missing observations (vertically and horrizontally)? If so, how frequently does this occur?
** Num of Missing values by columns ** Num of Missing values by rows ** Decision on droping missing values
2. Examine descriptive statistics for each variable
For categorical variables, answer the main questions like: * [How many distinct values or “levels” does the variable exhibit](#dist_cat) * [How often does each of these levels occur in the dataset?](#cat_level) * [How does the behavior of another variable X vary over the levels of C?](#behavior)
For numerical variable, answer the main questions like: * [What is the mean, median, standard deviation?](#numsummary) * [Does the data follow the normal distribution?](#normality) ** [Shapiro-Wilk Test](#shapiro)
- Where possible—certainly for any variable of particular interest—examine exploratory
visualizations and identify anomalies
- Look at the relations between key variables using the ideas of visualization and statistical tests
** Log Transformation Method ** L2 Normalization ** BoxCox method ** Min-Max Method
* Statistical Tests ** [Continuous Variables](#contvar) ** [Ordinal Variables](#ordinal) ** [Binary Variables](#binary) ** [Summary Table](#summary)
Packages¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind
from scipy.stats import boxcox
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
# prompt: mount google drive
# from google.colab import drive
# drive.mount('/content/drive')
Setting¶
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
Data Set¶
Load dataset¶
df = pd.read_csv("filtered_data.csv")
df_schema = pd.read_csv('schema.csv')
Data Exploration¶
# number of records and variables
df.shape
(18883, 70)
Schema¶
# variable names
df.columns
Index(['Hospital Mortality', 'Age', 'Gender', 'Uncomplicated Hypertension',
'Complicated Hypertension', 'Uncomplicated Diabetes',
'Complicated Diabetes', 'Malignancy', 'Hematologic Disease',
'Metastasis', 'Peripheral Vascular Disease', 'Hypothyroidism',
'Chronic Heart Failure', 'Stroke', 'Liver Disease', 'SAPS II', 'SOFA',
'OASIS', 'Sepsis', 'Any Organ Failure', 'Severe Respiratory Failure',
'Severe Coagulation Failure', 'Severe Liver Failure',
'Severe Cardiovascular Failure',
'Severe Central Nervous System Failure', 'Severe Renal Failure',
'Respiratory Dysfunction', 'Cardiovascular Dysfunction',
'Renal Dysfunction', 'Hematologic Dysfunction', 'Metabolic Dysfunction',
'Neurologic Dysfunction', 'Max Heart Rate', 'Min Heart Rate',
'Mean Heart Rate', 'Max MAP', 'Min MAP', 'Mean MAP',
'Max Systolic Pressure', 'Min Systolic Pressure',
'Mean Systolic Pressure', 'Max Diastolic Pressure',
'Min Diastolic Pressure', 'Mean Diastolic Pressure', 'Max Temperature',
'Min Temperature', 'Mean Temperature', 'Max Lactate', 'Min Lactate',
'Mean Lactate', 'Max pH', 'Min pH', 'Mean pH', 'Max Glucose',
'Min Glucose', 'Mean Glucose', 'Max WBC', 'Min WBC', 'Mean WBC',
'Max BUN', 'Min BUN', 'Mean BUN', 'Max Creatinine', 'Min Creatinine',
'Mean Creatinine', 'Max Hemoglobin', 'Min Hemoglobin',
'Mean Hemoglobin', 'Ventilation Duration (h)', 'RRT'],
dtype='object')
#Check the data types if correct
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18883 entries, 0 to 18882 Data columns (total 70 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Hospital Mortality 18883 non-null int64 1 Age 18883 non-null int64 2 Gender 18883 non-null object 3 Uncomplicated Hypertension 18883 non-null int64 4 Complicated Hypertension 18883 non-null int64 5 Uncomplicated Diabetes 18883 non-null int64 6 Complicated Diabetes 18883 non-null int64 7 Malignancy 18883 non-null int64 8 Hematologic Disease 18883 non-null int64 9 Metastasis 18883 non-null int64 10 Peripheral Vascular Disease 18883 non-null int64 11 Hypothyroidism 18883 non-null int64 12 Chronic Heart Failure 18883 non-null int64 13 Stroke 18883 non-null int64 14 Liver Disease 18883 non-null int64 15 SAPS II 18883 non-null int64 16 SOFA 18883 non-null int64 17 OASIS 18883 non-null int64 18 Sepsis 18883 non-null int64 19 Any Organ Failure 18883 non-null int64 20 Severe Respiratory Failure 18883 non-null int64 21 Severe Coagulation Failure 18883 non-null int64 22 Severe Liver Failure 18883 non-null int64 23 Severe Cardiovascular Failure 18883 non-null int64 24 Severe Central Nervous System Failure 18883 non-null int64 25 Severe Renal Failure 18883 non-null int64 26 Respiratory Dysfunction 18883 non-null int64 27 Cardiovascular Dysfunction 18883 non-null int64 28 Renal Dysfunction 18883 non-null int64 29 Hematologic Dysfunction 18883 non-null int64 30 Metabolic Dysfunction 18883 non-null int64 31 Neurologic Dysfunction 18883 non-null int64 32 Max Heart Rate 18842 non-null float64 33 Min Heart Rate 18842 non-null float64 34 Mean Heart Rate 18842 non-null float64 35 Max MAP 18841 non-null float64 36 Min MAP 18841 non-null float64 37 Mean MAP 18841 non-null float64 38 Max Systolic Pressure 18823 non-null float64 39 Min Systolic Pressure 18823 non-null float64 40 Mean Systolic Pressure 18823 non-null float64 41 Max Diastolic Pressure 18822 non-null float64 42 Min Diastolic Pressure 18822 non-null float64 43 Mean Diastolic Pressure 18822 non-null float64 44 Max Temperature 18196 non-null float64 45 Min Temperature 18196 non-null float64 46 Mean Temperature 18196 non-null float64 47 Max Lactate 13782 non-null float64 48 Min Lactate 13782 non-null float64 49 Mean Lactate 13782 non-null float64 50 Max pH 17710 non-null float64 51 Min pH 17710 non-null float64 52 Mean pH 17710 non-null float64 53 Max Glucose 18808 non-null float64 54 Min Glucose 18808 non-null float64 55 Mean Glucose 18808 non-null float64 56 Max WBC 18654 non-null float64 57 Min WBC 18654 non-null float64 58 Mean WBC 18654 non-null float64 59 Max BUN 18788 non-null float64 60 Min BUN 18788 non-null float64 61 Mean BUN 18788 non-null float64 62 Max Creatinine 18788 non-null float64 63 Min Creatinine 18788 non-null float64 64 Mean Creatinine 18788 non-null float64 65 Max Hemoglobin 18792 non-null float64 66 Min Hemoglobin 18792 non-null float64 67 Mean Hemoglobin 18792 non-null float64 68 Ventilation Duration (h) 18386 non-null float64 69 RRT 18883 non-null int64 dtypes: float64(37), int64(32), object(1) memory usage: 10.1+ MB
df_schema.set_index('variable_name', inplace=True)
df_schema
| category | variable_type | |
|---|---|---|
| variable_name | ||
| Hospital Mortality | Target | binary |
| Age | Demographic | continuous |
| Gender | Demographic | binary |
| Uncomplicated Hypertension | Medical history | binary |
| Complicated Hypertension | Medical history | binary |
| Uncomplicated Diabetes | Medical history | binary |
| Complicated Diabetes | Medical history | binary |
| Malignancy | Medical history | binary |
| Hematologic Disease | Medical history | binary |
| Metastasis | Medical history | binary |
| Peripheral Vascular Disease | Medical history | binary |
| Hypothyroidism | Medical history | binary |
| Chronic Heart Failure | Medical history | binary |
| Stroke | Medical history | binary |
| Liver Disease | Medical history | binary |
| SAPS II | Disease severity | ordinal |
| SOFA | Disease severity | ordinal |
| OASIS | Disease severity | ordinal |
| Sepsis | Diagnosis | binary |
| Any Organ Failure | Diagnosis | binary |
| Severe Respiratory Failure | Diagnosis | binary |
| Severe Coagulation Failure | Diagnosis | binary |
| Severe Liver Failure | Diagnosis | binary |
| Severe Cardiovascular Failure | Diagnosis | binary |
| Severe Central Nervous System Failure | Diagnosis | binary |
| Severe Renal Failure | Diagnosis | binary |
| Respiratory Dysfunction | Diagnosis | binary |
| Cardiovascular Dysfunction | Diagnosis | binary |
| Renal Dysfunction | Diagnosis | binary |
| Hematologic Dysfunction | Diagnosis | binary |
| Metabolic Dysfunction | Diagnosis | binary |
| Neurologic Dysfunction | Diagnosis | binary |
| Max Heart Rate | Vital signs | continuous |
| Min Heart Rate | Vital signs | continuous |
| Mean Heart Rate | Vital signs | continuous |
| Max MAP | Vital signs | continuous |
| Min MAP | Vital signs | continuous |
| Mean MAP | Vital signs | continuous |
| Max Systolic Pressure | Vital signs | continuous |
| Min Systolic Pressure | Vital signs | continuous |
| Mean Systolic Pressure | Vital signs | continuous |
| Max Diastolic Pressure | Vital signs | continuous |
| Min Diastolic Pressure | Vital signs | continuous |
| Mean Diastolic Pressure | Vital signs | continuous |
| Max Temperature | Vital signs | continuous |
| Min Temperature | Vital signs | continuous |
| Mean Temperature | Vital signs | continuous |
| Max Lactate | Laboratory results | continuous |
| Min Lactate | Laboratory results | continuous |
| Mean Lactate | Laboratory results | continuous |
| Max pH | Laboratory results | continuous |
| Min pH | Laboratory results | continuous |
| Mean pH | Laboratory results | continuous |
| Max Glucose | Laboratory results | continuous |
| Min Glucose | Laboratory results | continuous |
| Mean Glucose | Laboratory results | continuous |
| Max WBC | Laboratory results | continuous |
| Min WBC | Laboratory results | continuous |
| Mean WBC | Laboratory results | continuous |
| Max BUN | Laboratory results | continuous |
| Min BUN | Laboratory results | continuous |
| Mean BUN | Laboratory results | continuous |
| Max Creatinine | Laboratory results | continuous |
| Min Creatinine | Laboratory results | continuous |
| Mean Creatinine | Laboratory results | continuous |
| Max Hemoglobin | Laboratory results | continuous |
| Min Hemoglobin | Laboratory results | continuous |
| Mean Hemoglobin | Laboratory results | continuous |
| Ventilation Duration (h) | Treatment | continuous |
| RRT | Treatment | binary |
Schema Visualizations¶
# Calculate the percentages
df_percentages = df_schema['category'].value_counts(normalize=True) * 100
# Plot the percentages
df_percentages.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.title('Percentage of Variables in Each Category _ 67 Features')
plt.gca().spines[['top', 'right']].set_visible(False)
# Calculate the percentages
df_percentages = df_schema['variable_type'].value_counts(normalize=True) * 100
# Plot the percentages
df_percentages.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.title('Percentage of Variables in Each Type _ 67 Features')
plt.gca().spines[['top', 'right']].set_visible(False)
Unique Values¶
# number of unique values per variable
df.nunique()
Hospital Mortality 2 Age 72 Gender 2 Uncomplicated Hypertension 2 Complicated Hypertension 2 Uncomplicated Diabetes 2 Complicated Diabetes 2 Malignancy 2 Hematologic Disease 2 Metastasis 2 Peripheral Vascular Disease 2 Hypothyroidism 2 Chronic Heart Failure 2 Stroke 2 Liver Disease 2 SAPS II 107 SOFA 23 OASIS 64 Sepsis 2 Any Organ Failure 2 Severe Respiratory Failure 2 Severe Coagulation Failure 2 Severe Liver Failure 2 Severe Cardiovascular Failure 2 Severe Central Nervous System Failure 2 Severe Renal Failure 2 Respiratory Dysfunction 2 Cardiovascular Dysfunction 2 Renal Dysfunction 2 Hematologic Dysfunction 2 Metabolic Dysfunction 2 Neurologic Dysfunction 2 Max Heart Rate 162 Min Heart Rate 132 Mean Heart Rate 12730 Max MAP 426 Min MAP 258 Mean MAP 14561 Max Systolic Pressure 205 Min Systolic Pressure 167 Mean Systolic Pressure 13300 Max Diastolic Pressure 179 Min Diastolic Pressure 92 Mean Diastolic Pressure 11778 Max Temperature 334 Min Temperature 369 Mean Temperature 11507 Max Lactate 239 Min Lactate 182 Mean Lactate 946 Max pH 89 Min pH 95 Mean pH 78 Max Glucose 573 Min Glucose 332 Mean Glucose 4984 Max WBC 553 Min WBC 445 Mean WBC 1974 Max BUN 167 Min BUN 151 Mean BUN 1077 Max Creatinine 144 Min Creatinine 123 Mean Creatinine 606 Max Hemoglobin 143 Min Hemoglobin 159 Mean Hemoglobin 934 Ventilation Duration (h) 5895 RRT 2 dtype: int64
Most Frequently occuring value per variable
# most frequently occuring value and the count
most_frequent_values = {}
for column in df.columns:
most_common = df[column].value_counts().idxmax()
count = df[column].value_counts().max()
most_frequent_values[column] = {'value': most_common, 'count': count}
# DataFrame from the dictionary
result_df = pd.DataFrame(most_frequent_values).T
print(result_df)
value count Hospital Mortality 0 15866 Age 77 504 Gender M 11457 Uncomplicated Hypertension 0 10123 Complicated Hypertension 0 17418 Uncomplicated Diabetes 0 14980 Complicated Diabetes 0 17924 Malignancy 0 16871 Hematologic Disease 0 16070 Metastasis 0 18055 Peripheral Vascular Disease 0 17244 Hypothyroidism 0 17307 Chronic Heart Failure 0 14348 Stroke 0 17825 Liver Disease 0 17085 SAPS II 34 692 SOFA 4 2872 OASIS 35 972 Sepsis 0 16063 Any Organ Failure 1 9574 Severe Respiratory Failure 0 17663 Severe Coagulation Failure 0 18783 Severe Liver Failure 0 18663 Severe Cardiovascular Failure 0 16581 Severe Central Nervous System Failure 0 17784 Severe Renal Failure 0 17965 Respiratory Dysfunction 0 14003 Cardiovascular Dysfunction 0 16329 Renal Dysfunction 0 14244 Hematologic Dysfunction 0 16858 Metabolic Dysfunction 0 17010 Neurologic Dysfunction 0 17190 Max Heart Rate 88.0 534.0 Min Heart Rate 70.0 647.0 Mean Heart Rate 87.0 26.0 Max MAP 93.0 484.0 Min MAP 58.0 773.0 Mean MAP 74.0 26.0 Max Systolic Pressure 150.0 396.0 Min Systolic Pressure 85.0 601.0 Mean Systolic Pressure 108.0 26.0 Max Diastolic Pressure 80.0 572.0 Min Diastolic Pressure 45.0 863.0 Mean Diastolic Pressure 60.0 37.0 Max Temperature 37.5 645.0 Min Temperature 36.111111 607.0 Mean Temperature 36.944444 51.0 Max Lactate 2.0 461.0 Min Lactate 1.0 976.0 Mean Lactate 1.4 366.0 Max pH 7.44 1224.0 Min pH 7.32 963.0 Mean pH 7.38 1421.0 Max Glucose 165.0 177.0 Min Glucose 99.0 343.0 Mean Glucose 130.0 76.0 Max WBC 12.3 162.0 Min WBC 10.2 193.0 Mean WBC 9.7 94.0 Max BUN 15.0 1041.0 Min BUN 13.0 1128.0 Mean BUN 14.0 544.0 Max Creatinine 0.8 2237.0 Min Creatinine 0.7 2678.0 Mean Creatinine 0.8 1211.0 Max Hemoglobin 12.7 403.0 Min Hemoglobin 9.4 369.0 Mean Hemoglobin 9.7 138.0 Ventilation Duration (h) 4.0 354.0 RRT 0 18328
Data set Motification¶
# missing values column-wise
na_count = df.isnull().sum() # total count
na_pct = (na_count/len(df))*100 # percentage
na_df = pd.DataFrame({'Count': na_count.values,
'Percentage (%)': na_pct}).reset_index().rename(columns = {'index': 'Feature'})
na_df.sort_values(by='Percentage (%)', ascending=False)
| Feature | Count | Percentage (%) | |
|---|---|---|---|
| 49 | Mean Lactate | 5101 | 27.013716 |
| 48 | Min Lactate | 5101 | 27.013716 |
| 47 | Max Lactate | 5101 | 27.013716 |
| 50 | Max pH | 1173 | 6.211937 |
| 51 | Min pH | 1173 | 6.211937 |
| 52 | Mean pH | 1173 | 6.211937 |
| 44 | Max Temperature | 687 | 3.638193 |
| 45 | Min Temperature | 687 | 3.638193 |
| 46 | Mean Temperature | 687 | 3.638193 |
| 68 | Ventilation Duration (h) | 497 | 2.631997 |
| 58 | Mean WBC | 229 | 1.212731 |
| 57 | Min WBC | 229 | 1.212731 |
| 56 | Max WBC | 229 | 1.212731 |
| 64 | Mean Creatinine | 95 | 0.503098 |
| 63 | Min Creatinine | 95 | 0.503098 |
| 62 | Max Creatinine | 95 | 0.503098 |
| 61 | Mean BUN | 95 | 0.503098 |
| 60 | Min BUN | 95 | 0.503098 |
| 59 | Max BUN | 95 | 0.503098 |
| 65 | Max Hemoglobin | 91 | 0.481915 |
| 66 | Min Hemoglobin | 91 | 0.481915 |
| 67 | Mean Hemoglobin | 91 | 0.481915 |
| 55 | Mean Glucose | 75 | 0.397183 |
| 53 | Max Glucose | 75 | 0.397183 |
| 54 | Min Glucose | 75 | 0.397183 |
| 43 | Mean Diastolic Pressure | 61 | 0.323042 |
| 42 | Min Diastolic Pressure | 61 | 0.323042 |
| 41 | Max Diastolic Pressure | 61 | 0.323042 |
| 38 | Max Systolic Pressure | 60 | 0.317746 |
| 39 | Min Systolic Pressure | 60 | 0.317746 |
| 40 | Mean Systolic Pressure | 60 | 0.317746 |
| 35 | Max MAP | 42 | 0.222422 |
| 37 | Mean MAP | 42 | 0.222422 |
| 36 | Min MAP | 42 | 0.222422 |
| 32 | Max Heart Rate | 41 | 0.217127 |
| 33 | Min Heart Rate | 41 | 0.217127 |
| 34 | Mean Heart Rate | 41 | 0.217127 |
| 0 | Hospital Mortality | 0 | 0.000000 |
| 1 | Age | 0 | 0.000000 |
| 31 | Neurologic Dysfunction | 0 | 0.000000 |
| 2 | Gender | 0 | 0.000000 |
| 3 | Uncomplicated Hypertension | 0 | 0.000000 |
| 4 | Complicated Hypertension | 0 | 0.000000 |
| 5 | Uncomplicated Diabetes | 0 | 0.000000 |
| 6 | Complicated Diabetes | 0 | 0.000000 |
| 7 | Malignancy | 0 | 0.000000 |
| 8 | Hematologic Disease | 0 | 0.000000 |
| 9 | Metastasis | 0 | 0.000000 |
| 10 | Peripheral Vascular Disease | 0 | 0.000000 |
| 11 | Hypothyroidism | 0 | 0.000000 |
| 12 | Chronic Heart Failure | 0 | 0.000000 |
| 13 | Stroke | 0 | 0.000000 |
| 14 | Liver Disease | 0 | 0.000000 |
| 15 | SAPS II | 0 | 0.000000 |
| 16 | SOFA | 0 | 0.000000 |
| 17 | OASIS | 0 | 0.000000 |
| 18 | Sepsis | 0 | 0.000000 |
| 19 | Any Organ Failure | 0 | 0.000000 |
| 20 | Severe Respiratory Failure | 0 | 0.000000 |
| 21 | Severe Coagulation Failure | 0 | 0.000000 |
| 22 | Severe Liver Failure | 0 | 0.000000 |
| 23 | Severe Cardiovascular Failure | 0 | 0.000000 |
| 24 | Severe Central Nervous System Failure | 0 | 0.000000 |
| 25 | Severe Renal Failure | 0 | 0.000000 |
| 26 | Respiratory Dysfunction | 0 | 0.000000 |
| 27 | Cardiovascular Dysfunction | 0 | 0.000000 |
| 28 | Renal Dysfunction | 0 | 0.000000 |
| 29 | Hematologic Dysfunction | 0 | 0.000000 |
| 30 | Metabolic Dysfunction | 0 | 0.000000 |
| 69 | RRT | 0 | 0.000000 |
temp_df = df.copy()
# Calculate missing values by row
missing_values_by_row = df.isnull().sum(axis=1)
# Add the missing values count to the original DataFrame
temp_df["MissingValuesCount"] = missing_values_by_row
total_rows = len(temp_df)
temp_df["MissingValuesPercentage"] = (temp_df["MissingValuesCount"] / total_rows) * 100
# Sort the DataFrame by the "MissingValuesCount" column in descending order
df_sorted = temp_df.sort_values(by="MissingValuesPercentage", ascending=False)
# Print the sorted DataFrame
print(df_sorted[['MissingValuesCount','MissingValuesPercentage']].head(100))
MissingValuesCount MissingValuesPercentage 6766 36 0.190648 11565 36 0.190648 960 36 0.190648 6480 36 0.190648 8888 27 0.142986 8750 25 0.132394 1794 24 0.127098 4749 24 0.127098 4900 24 0.127098 13876 24 0.127098 13534 24 0.127098 10804 24 0.127098 4171 24 0.127098 1682 22 0.116507 4219 22 0.116507 1509 22 0.116507 8097 22 0.116507 11525 22 0.116507 15573 22 0.116507 12752 21 0.111211 18151 21 0.111211 18715 21 0.111211 12726 21 0.111211 17788 21 0.111211 12586 21 0.111211 12849 21 0.111211 3212 21 0.111211 8123 21 0.111211 10182 21 0.111211 18534 21 0.111211 1462 21 0.111211 2913 21 0.111211 13524 21 0.111211 5816 21 0.111211 17881 21 0.111211 4744 21 0.111211 1447 21 0.111211 15116 21 0.111211 14456 21 0.111211 10515 21 0.111211 16846 21 0.111211 18680 21 0.111211 6245 21 0.111211 16475 21 0.111211 13882 21 0.111211 3341 21 0.111211 11789 21 0.111211 15382 21 0.111211 9566 21 0.111211 7869 21 0.111211 14267 21 0.111211 6116 21 0.111211 6595 21 0.111211 2351 21 0.111211 18018 21 0.111211 11422 21 0.111211 4541 21 0.111211 3911 21 0.111211 10088 21 0.111211 8692 19 0.100620 7663 19 0.100620 17549 18 0.095324 10339 18 0.095324 4369 18 0.095324 13242 18 0.095324 1589 18 0.095324 8800 18 0.095324 2248 18 0.095324 9356 18 0.095324 14094 18 0.095324 3458 18 0.095324 947 18 0.095324 14802 18 0.095324 16359 18 0.095324 18190 18 0.095324 9073 18 0.095324 924 16 0.084732 10246 16 0.084732 11980 15 0.079437 12658 15 0.079437 3926 15 0.079437 16373 15 0.079437 17372 15 0.079437 13374 15 0.079437 6557 15 0.079437 3017 15 0.079437 3224 15 0.079437 9248 15 0.079437 18260 15 0.079437 16385 15 0.079437 341 15 0.079437 18001 15 0.079437 16029 15 0.079437 11143 15 0.079437 16852 15 0.079437 15831 15 0.079437 9668 15 0.079437 4584 15 0.079437 7438 15 0.079437 687 15 0.079437
Visualize the missing values¶
# heatmap
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
before_clear_mv = df.shape
print(f'Before clean missing values, the dataset has {before_clear_mv[0]} rows and {before_clear_mv[1]} variables')
Before clean missing values, the dataset has 18883 rows and 70 variables
# drop the columns RRT and Ventilation Duration (h) in df
df_orig = df.copy()
df = df.drop(['RRT', 'Ventilation Duration (h)'], axis=1)
def calculate_missing_percentages(df):
df_copy = df.copy() # Create a copy of the DataFrame
missing_percentages = df_copy.isna().sum(axis=1) / len(df.columns) * 100
missing_percentages = missing_percentages.round(2)
# Create a new column to store the percentage of missing values
df_copy['missing_percentage'] = missing_percentages
# Group the DataFrame by the 'missing_percentage' column and count the number of observations in each group
counts = df_copy.groupby('missing_percentage').size().reset_index(name='count')
# Sort the table by percentage from highest to lowest
counts = counts.sort_values(by='missing_percentage', ascending=False)
# Print the table
print(counts)
dem_f = df_schema[df_schema['category'] == 'Demographic'].index
medical_f = df_schema[df_schema['category'] == 'Medical history'].index
severe_f = df_schema[df_schema['category'] == 'Disease severity'].index
diag_f = df_schema[df_schema['category'] == 'Diagnosis'].index
vital_f = df_schema[df_schema['category'] == 'Vital signs'].index
lab_f = df_schema[df_schema['category'] == 'Laboratory results'].index
dem_df = df[dem_f]
medical_df = df[medical_f]
severe_df = df[severe_f]
diag_df = df[diag_f]
vital_df = df[vital_f]
lab_df = df[lab_f]
print('Demographic')
print(calculate_missing_percentages(dem_df))
print('-----------------------------')
print()
print('Medical history')
print(calculate_missing_percentages(medical_df))
print('-----------------------------')
print()
print('Disease severity')
print(calculate_missing_percentages(severe_df))
print('-----------------------------')
print()
print('Diagnosis')
print(calculate_missing_percentages(diag_df))
print('-----------------------------')
print()
print('Vital signs')
print(calculate_missing_percentages(vital_df))
print('-----------------------------')
print()
print('Laboratory results')
print(calculate_missing_percentages(lab_df))
Demographic missing_percentage count 0 0.0 18883 None ----------------------------- Medical history missing_percentage count 0 0.0 18883 None ----------------------------- Disease severity missing_percentage count 0 0.0 18883 None ----------------------------- Diagnosis missing_percentage count 0 0.0 18883 None ----------------------------- Vital signs missing_percentage count 4 100.0 41 3 60.0 1 2 40.0 18 1 20.0 647 0 0.0 18176 None ----------------------------- Laboratory results missing_percentage count 7 100.00 50 6 85.71 9 5 71.43 6 4 57.14 15 3 42.86 28 2 28.57 968 1 14.29 4345 0 0.00 13462 None
df.dropna(subset=['Max Heart Rate', 'Min Heart Rate', 'Mean Heart Rate',
'Max MAP', 'Min MAP', 'Mean MAP',
'Max Systolic Pressure', 'Min Systolic Pressure', 'Mean Systolic Pressure',
'Max Diastolic Pressure', 'Min Diastolic Pressure', 'Mean Diastolic Pressure',
'Max Temperature', 'Min Temperature', 'Mean Temperature'],
inplace=True)
df.shape
(18176, 68)
df.dropna(subset=['Max Lactate', 'Min Lactate', 'Mean Lactate',
'Max pH', 'Min pH', 'Mean pH',
'Max Glucose', 'Min Glucose', 'Mean Glucose',
'Max WBC', 'Min WBC', 'Mean WBC',
'Max BUN', 'Min BUN', 'Mean BUN',
'Max Creatinine', 'Min Creatinine', 'Mean Creatinine',
'Max Hemoglobin', 'Min Hemoglobin', 'Mean Hemoglobin'],
inplace=True)
after_clean_missing_values = df.shape
After Clear up missing values¶
print(f'Before clean missing values, the dataset has {before_clear_mv[0]} rows and {before_clear_mv[1]} variables')
print(f'After clean missing values, the dataset has {after_clean_missing_values[0]} rows and {after_clean_missing_values[1]} variables')
Before clean missing values, the dataset has 18883 rows and 70 variables After clean missing values, the dataset has 12799 rows and 68 variables
Viz after clear up missing value¶
# heatmap
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
Valid Range¶
Make sure all variables are within valid range
df = df.loc[
((df['Max Heart Rate'] >= 0) & (df['Max Heart Rate'] <= 350))
& ((df['Min Heart Rate'] >= 0) & (df['Max Heart Rate'] <= 350))
& ((df['Mean Heart Rate'] >= 0) & (df['Mean Heart Rate'] <= 350))
& ((df['Max MAP'] >= 14) & (df['Max MAP'] <= 330))
& ((df['Min MAP'] >= 14) & (df['Min MAP'] <= 330))
& ((df['Mean MAP'] >= 14) & (df['Mean MAP'] <= 330))
& ((df['Min Systolic Pressure'] >= 0) & (df['Min Systolic Pressure'] <= 375))
& ((df['Max Systolic Pressure'] >= 0) & (df['Max Systolic Pressure'] <= 375))
& ((df['Mean Systolic Pressure'] >= 0) & (df['Mean Systolic Pressure'] <= 375))
& ((df['Min Diastolic Pressure'] >= 0) & (df['Min Diastolic Pressure'] <= 375))
& ((df['Max Diastolic Pressure'] >= 0) & (df['Max Diastolic Pressure'] <= 375))
& ((df['Mean Diastolic Pressure'] >= 0) & (df['Mean Diastolic Pressure'] <= 375))
& ((df['Min Temperature'] >= 26)& (df['Min Temperature'] <= 45))
& ((df['Max Temperature'] >= 26)& (df['Max Temperature'] <= 45))
& ((df['Mean Temperature'] >= 26)& (df['Mean Temperature'] <= 45))
& ((df['Min pH'] >= 0)& (df['Min pH'] <= 14))
& ((df['Max pH'] >= 0)& (df['Max pH'] <= 14))
& ((df['Mean pH'] >= 0)& (df['Mean pH'] <= 14))
& ((df['Min Lactate'] >= 0.4)& (df['Min Lactate'] <= 30))
& ((df['Max Lactate'] >= 0.4)& (df['Max Lactate'] <= 30))
& ((df['Mean Lactate'] >= 0.4)& (df['Mean Lactate'] <= 30))
& ((df['Min Glucose'] >= 33)& (df['Min Glucose'] <= 2000))
& ((df['Max Glucose'] >= 33)& (df['Max Glucose'] <= 2000))
& ((df['Mean Glucose'] >= 33)& (df['Mean Glucose'] <= 2000))
& ((df['Min WBC'] >= 0)& (df['Min WBC'] <= 1000))
& ((df['Max WBC'] >= 0)& (df['Max WBC'] <= 1000))
& ((df['Mean WBC'] >= 0)& (df['Mean WBC'] <= 1000))
& ((df['Min BUN'] >= 0)& (df['Min BUN'] <= 250))
& ((df['Max BUN'] >= 0)& (df['Max BUN'] <= 250))
& ((df['Mean BUN'] >= 0)& (df['Mean BUN'] <= 250))
& ((df['Min Creatinine'] >= 0.1)& (df['Min Creatinine'] <= 60))
& ((df['Max Creatinine'] >= 0.1)& (df['Max Creatinine'] <= 60))
& ((df['Mean Creatinine'] >= 0.1)& (df['Mean Creatinine'] <= 60))
& ((df['Min Hemoglobin'] >= 0) & (df['Min Hemoglobin'] <= 25))
& ((df['Max Hemoglobin'] >= 0) & (df['Max Hemoglobin'] <= 25))
& ((df['Mean Hemoglobin'] >= 0) & (df['Mean Hemoglobin'] <= 25))
]
print(df['Hospital Mortality'].value_counts())
Hospital Mortality 0 10331 1 2158 Name: count, dtype: int64
df.to_csv("/content/data_new.csv")
--------------------------------------------------------------------------- OSError Traceback (most recent call last) Cell In[58], line 1 ----> 1 df.to_csv("/content/data_new.csv") File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\util\_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 327 if len(args) > num_allow_args: 328 warnings.warn( 329 msg.format(arguments=_format_argument_list(allow_args)), 330 FutureWarning, 331 stacklevel=find_stack_level(), 332 ) --> 333 return func(*args, **kwargs) File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:3964, in NDFrame.to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options) 3953 df = self if isinstance(self, ABCDataFrame) else self.to_frame() 3955 formatter = DataFrameFormatter( 3956 frame=df, 3957 header=header, (...) 3961 decimal=decimal, 3962 ) -> 3964 return DataFrameRenderer(formatter).to_csv( 3965 path_or_buf, 3966 lineterminator=lineterminator, 3967 sep=sep, 3968 encoding=encoding, 3969 errors=errors, 3970 compression=compression, 3971 quoting=quoting, 3972 columns=columns, 3973 index_label=index_label, 3974 mode=mode, 3975 chunksize=chunksize, 3976 quotechar=quotechar, 3977 date_format=date_format, 3978 doublequote=doublequote, 3979 escapechar=escapechar, 3980 storage_options=storage_options, 3981 ) File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\formats\format.py:1014, in DataFrameRenderer.to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, errors, storage_options) 993 created_buffer = False 995 csv_formatter = CSVFormatter( 996 path_or_buf=path_or_buf, 997 lineterminator=lineterminator, (...) 1012 formatter=self.fmt, 1013 ) -> 1014 csv_formatter.save() 1016 if created_buffer: 1017 assert isinstance(path_or_buf, StringIO) File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\formats\csvs.py:251, in CSVFormatter.save(self) 247 """ 248 Create the writer & save. 249 """ 250 # apply compression and byte/text conversion --> 251 with get_handle( 252 self.filepath_or_buffer, 253 self.mode, 254 encoding=self.encoding, 255 errors=self.errors, 256 compression=self.compression, 257 storage_options=self.storage_options, 258 ) as handles: 259 # Note: self.encoding is irrelevant here 260 self.writer = csvlib.writer( 261 handles.handle, 262 lineterminator=self.lineterminator, (...) 267 quotechar=self.quotechar, 268 ) 270 self._save() File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\common.py:749, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 747 # Only for write methods 748 if "r" not in mode and is_path: --> 749 check_parent_directory(str(handle)) 751 if compression: 752 if compression != "zstd": 753 # compression libraries do not like an explicit text-mode File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\common.py:616, in check_parent_directory(path) 614 parent = Path(path).parent 615 if not parent.is_dir(): --> 616 raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'") OSError: Cannot save file into a non-existent directory: '\content'
Create subsets¶
cont_dem = df_schema[(df_schema['variable_type'] == 'continuous') & (df_schema['category'] == 'Demographic')].index
cont_vital = df_schema[(df_schema['variable_type'] == 'continuous') & (df_schema['category'] == 'Vital signs')].index
cont_lab = df_schema[
(df_schema['variable_type'] == 'continuous')
& (df_schema['category'] == 'Laboratory results')
& df_schema.index.str.contains('Lactate|Crea|Hemog')
].index
cont_lab2 = df_schema[
(df_schema['variable_type'] == 'continuous')
& (df_schema['category'] == 'Laboratory results')
& df_schema.index.str.contains('Glucose|WBC|BUN|pH')
].index
binary = df_schema[((df_schema['variable_type']=='binary') | (df_schema.index == 'Hospital Mortality')) & (df_schema.index != 'RRT')].index
ordinal = df_schema[((df_schema['variable_type'] == 'ordinal') | (df_schema.index == 'Hospital Mortality'))].index
cont_all = df_schema[((df_schema['variable_type'] == 'continuous') | (df_schema.index == 'Hospital Mortality')) & (df_schema['category'] != 'Treatment')].index
cont_dem_df = df[cont_dem]
cont_vital_df = df[cont_vital]
cont_lab_df = df[cont_lab]
cont_lab2_df = df[cont_lab2]
cont_all_df = df[cont_all]
binary_df = df[binary]
ordinal_df = df[ordinal]
Categorical Variables¶
cat_feats = ['Hospital Mortality', 'Gender', 'Uncomplicated Hypertension',
'Complicated Hypertension', 'Uncomplicated Diabetes',
'Complicated Diabetes', 'Malignancy', 'Hematologic Disease',
'Metastasis', 'Peripheral Vascular Disease', 'Hypothyroidism',
'Chronic Heart Failure', 'Stroke', 'Liver Disease',
'Sepsis', 'Any Organ Failure', 'Severe Respiratory Failure',
'Severe Coagulation Failure', 'Severe Liver Failure',
'Severe Cardiovascular Failure',
'Severe Central Nervous System Failure', 'Severe Renal Failure',
'Respiratory Dysfunction', 'Cardiovascular Dysfunction',
'Renal Dysfunction', 'Hematologic Dysfunction', 'Metabolic Dysfunction',
'Neurologic Dysfunction']
categorical_stats = df[cat_feats].apply(lambda x: x.nunique())
categorical_stats
Hospital Mortality 2 Gender 2 Uncomplicated Hypertension 2 Complicated Hypertension 2 Uncomplicated Diabetes 2 Complicated Diabetes 2 Malignancy 2 Hematologic Disease 2 Metastasis 2 Peripheral Vascular Disease 2 Hypothyroidism 2 Chronic Heart Failure 2 Stroke 2 Liver Disease 2 Sepsis 2 Any Organ Failure 2 Severe Respiratory Failure 2 Severe Coagulation Failure 2 Severe Liver Failure 2 Severe Cardiovascular Failure 2 Severe Central Nervous System Failure 2 Severe Renal Failure 2 Respiratory Dysfunction 2 Cardiovascular Dysfunction 2 Renal Dysfunction 2 Hematologic Dysfunction 2 Metabolic Dysfunction 2 Neurologic Dysfunction 2 dtype: int64
print(f'There are total {len(cat_feats)} categorical variables')
There are total 28 categorical variables
value_counts_all = df[cat_feats].apply(pd.Series.value_counts)
value_counts_all
| Hospital Mortality | Gender | Uncomplicated Hypertension | Complicated Hypertension | Uncomplicated Diabetes | Complicated Diabetes | Malignancy | Hematologic Disease | Metastasis | Peripheral Vascular Disease | Hypothyroidism | Chronic Heart Failure | Stroke | Liver Disease | Sepsis | Any Organ Failure | Severe Respiratory Failure | Severe Coagulation Failure | Severe Liver Failure | Severe Cardiovascular Failure | Severe Central Nervous System Failure | Severe Renal Failure | Respiratory Dysfunction | Cardiovascular Dysfunction | Renal Dysfunction | Hematologic Dysfunction | Metabolic Dysfunction | Neurologic Dysfunction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10331.0 | NaN | 7001.0 | 11427.0 | 10013.0 | 11821.0 | 11021.0 | 10430.0 | 11876.0 | 11448.0 | 11428.0 | 9504.0 | 11795.0 | 11151.0 | 10228.0 | 5462.0 | 11493.0 | 12408.0 | 12325.0 | 10650.0 | 11753.0 | 11803.0 | 8811.0 | 10485.0 | 8963.0 | 10995.0 | 10963.0 | 11245.0 |
| 1 | 2158.0 | NaN | 5488.0 | 1062.0 | 2476.0 | 668.0 | 1468.0 | 2059.0 | 613.0 | 1041.0 | 1061.0 | 2985.0 | 694.0 | 1338.0 | 2261.0 | 7027.0 | 996.0 | 81.0 | 164.0 | 1839.0 | 736.0 | 686.0 | 3678.0 | 2004.0 | 3526.0 | 1494.0 | 1526.0 | 1244.0 |
| F | NaN | 4891.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| M | NaN | 7598.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df_cat = df[cat_feats]
# Count occurrences of 0 or 1 in each column
counts_0 = df_cat.apply(lambda x: x.isin([0]).sum())
counts_1 = df_cat.apply(lambda x: x.isin([1]).sum())
# Calculate percentage
percentages_0 = (counts_0 / len(df)) * 100
percentages_1 = (counts_1 / len(df)) * 100
perc_table = pd.DataFrame({'%0': percentages_0, '%1': percentages_1})
print(perc_table)
%0 %1 Hospital Mortality 82.720794 17.279206 Gender 0.000000 0.000000 Uncomplicated Hypertension 56.057330 43.942670 Complicated Hypertension 91.496517 8.503483 Uncomplicated Diabetes 80.174554 19.825446 Complicated Diabetes 94.651293 5.348707 Malignancy 88.245656 11.754344 Hematologic Disease 83.513492 16.486508 Metastasis 95.091681 4.908319 Peripheral Vascular Disease 91.664665 8.335335 Hypothyroidism 91.504524 8.495476 Chronic Heart Failure 76.098967 23.901033 Stroke 94.443110 5.556890 Liver Disease 89.286572 10.713428 Sepsis 81.896069 18.103931 Any Organ Failure 43.734486 56.265514 Severe Respiratory Failure 92.024982 7.975018 Severe Coagulation Failure 99.351429 0.648571 Severe Liver Failure 98.686844 1.313156 Severe Cardiovascular Failure 85.275042 14.724958 Severe Central Nervous System Failure 94.106814 5.893186 Severe Renal Failure 94.507166 5.492834 Respiratory Dysfunction 70.550084 29.449916 Cardiovascular Dysfunction 83.953879 16.046121 Renal Dysfunction 71.767155 28.232845 Hematologic Dysfunction 88.037473 11.962527 Metabolic Dysfunction 87.781247 12.218753 Neurologic Dysfunction 90.039235 9.960765
# First, select the columns you want to visualize
columns_to_visualize = cont_all_df # You can customize this based on your needs
# Remove the target column from the list (we don't want to plot it against itself)
# columns_to_visualize.remove('Hospital Mortality')
# Now, let's create density plots for each column against the target variable
for column in columns_to_visualize:
if column == 'Hospital Mortality':
pass
else:
fig, ax = plt.subplots(figsize=(10, 6))
sns.kdeplot(data=df[df['Hospital Mortality'] == 0][column], ax = ax, label='Alive', fill=True, color = 'g')
sns.kdeplot(data=df[df['Hospital Mortality'] == 1][column], ax = ax, label='Dead', fill=True, color = 'r')
sns.kdeplot(data=df[column], label='Overall Classes',ax = ax, fill=True,color='b')
plt.title(f'Density Plot of {column} by Hospital Mortality')
plt.xlabel(column)
plt.ylabel('Density')
plt.legend(title='Target')
plt.show()
Numerical Variables¶
# Separate numerical and categorical columns
numeric_features = ['Max Heart Rate', 'Min Heart Rate',
'Mean Heart Rate', 'Max MAP', 'Min MAP', 'Mean MAP',
'Max Systolic Pressure', 'Min Systolic Pressure',
'Mean Systolic Pressure', 'Max Diastolic Pressure',
'Min Diastolic Pressure', 'Mean Diastolic Pressure', 'Max Temperature',
'Min Temperature', 'Mean Temperature', 'Max Lactate', 'Min Lactate',
'Mean Lactate', 'Max pH', 'Min pH', 'Mean pH', 'Max Glucose',
'Min Glucose', 'Mean Glucose', 'Max WBC', 'Min WBC', 'Mean WBC',
'Max BUN', 'Min BUN', 'Mean BUN', 'Max Creatinine', 'Min Creatinine',
'Mean Creatinine', 'Max Hemoglobin', 'Min Hemoglobin',
'Mean Hemoglobin', 'SAPS II', 'SOFA',
'OASIS']
print(len(numeric_features))
39
subset_df = df[numeric_features]
# Get descriptive statistics for the selected features
statistics = subset_df.describe()
mean = statistics.loc['mean']
median = statistics.loc['50%'] # Median is the 50th percentile
std_dev = statistics.loc['std']
# Create a DataFrame to display the statistics in a table format
statistics_table = pd.DataFrame({'Mean': mean, 'Median': median, 'Standard Deviation': std_dev})
print(statistics_table)
Mean Median Standard Deviation Max Heart Rate 107.984474 106.000000 20.573834 Min Heart Rate 71.973765 71.000000 15.531557 Mean Heart Rate 88.430129 86.862069 15.372059 Max MAP 109.473430 103.000000 30.144985 Min MAP 57.380068 58.000000 12.192861 Mean MAP 78.237488 76.725000 10.222848 Max Systolic Pressure 152.713268 149.000000 24.243929 Min Systolic Pressure 87.046799 87.000000 17.442935 Mean Systolic Pressure 117.109939 114.826087 15.521249 Max Diastolic Pressure 83.701257 81.000000 18.191668 Min Diastolic Pressure 43.468572 44.000000 10.873250 Mean Diastolic Pressure 60.474160 59.657895 9.743214 Max Temperature 37.749812 37.722222 0.844661 Min Temperature 36.030913 36.111111 0.969213 Mean Temperature 36.962726 36.966667 0.724690 Max Lactate 3.351364 2.500000 2.793236 Min Lactate 1.809492 1.400000 1.397299 Mean Lactate 2.525750 2.000000 1.890930 Max pH 7.431130 7.440000 0.072189 Min pH 7.289433 7.300000 0.107355 Mean pH 7.364707 7.370000 0.070655 Max Glucose 192.261766 173.000000 88.173353 Min Glucose 109.329090 103.000000 36.344911 Mean Glucose 147.195135 136.000000 47.835281 Max WBC 15.648373 14.200000 11.421410 Min WBC 10.962118 10.000000 8.138209 Mean WBC 13.227665 12.100000 9.482562 Max BUN 25.997998 19.000000 20.530685 Min BUN 21.464008 16.000000 17.646050 Mean BUN 23.677328 17.500000 18.896375 Max Creatinine 1.440171 1.000000 1.404564 Min Creatinine 1.164977 0.800000 1.142817 Mean Creatinine 1.297446 0.930000 1.258200 Max Hemoglobin 12.418833 12.400000 1.961007 Min Hemoglobin 9.727840 9.600000 2.192672 Mean Hemoglobin 10.961597 10.730000 1.851872 SAPS II 39.696933 38.000000 15.151403 SOFA 5.412923 5.000000 3.419823 OASIS 36.362399 36.000000 8.186966
Normality check for the continuous variables before normalization. Notice through visual inspection that the outliers are causing the shape to be skewed.
# histogram and QQ plot for each column
def plot_histogram_qqplot(data, column):
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
# Histogram with median and mean reference lines
sns.histplot(data[column], kde=True, ax=axes[0])
median_val = data[column].median()
mean_val = data[column].mean()
axes[0].axvline(median_val, color='r', linestyle='dashed', linewidth=2, label=f'Median: {median_val:.2f}')
axes[0].axvline(mean_val, color='g', linestyle='dashed', linewidth=2, label=f'Mean: {mean_val:.2f}')
axes[0].set_title(f'Histogram - {column}')
axes[0].legend()
# QQ Plot
sm.qqplot(data[column], line='s', ax=axes[1])
axes[1].set_title(f'QQ Plot - {column}')
plt.show()
# Iterate through columns and create plots
for column in cont_all_df.columns:
if column != 'Hospital Mortality':
plot_histogram_qqplot(cont_all_df, column)
Shapiro Wilk test conducted on the dataset without missing values. The result for all variables is "not normally distributed."
# Perform Shapiro-Wilk test on each column in cont_all_df
for column in cont_all_df.columns[1:]:
stat, p = stats.shapiro(cont_all_df[column])
print(f"Shapiro-Wilk test for {column}:")
print(f" Statistic: {stat}")
print(f" p-value: {p}")
print("")
if p < 0.05:
print(f"The distribution of {column} is significantly different from normal.")
print("---------------------------------------------------------------------")
else:
print(f"The distribution of {column} is not significantly different from normal.")
print("---------------------------------------------------------------------")
Shapiro-Wilk test for Age: Statistic: 0.9579778909683228 p-value: 0.0 The distribution of Age is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Heart Rate: Statistic: 0.9716181755065918 p-value: 6.866362475191604e-44 The distribution of Max Heart Rate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Heart Rate: Statistic: 0.9904375672340393 p-value: 4.8635451107712535e-28 The distribution of Min Heart Rate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Heart Rate: Statistic: 0.9883667230606079 p-value: 1.270662467504408e-30 The distribution of Mean Heart Rate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max MAP: Statistic: 0.7134469747543335 p-value: 0.0 The distribution of Max MAP is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min MAP: Statistic: 0.9794363975524902 p-value: 8.243728163044217e-39 The distribution of Min MAP is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean MAP: Statistic: 0.9694177508354187 p-value: 4.203895392974451e-45 The distribution of Mean MAP is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Systolic Pressure: Statistic: 0.9592243432998657 p-value: 0.0 The distribution of Max Systolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Systolic Pressure: Statistic: 0.9820803999900818 p-value: 9.711820751813725e-37 The distribution of Min Systolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Systolic Pressure: Statistic: 0.9709146022796631 p-value: 2.6624670822171524e-44 The distribution of Mean Systolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Diastolic Pressure: Statistic: 0.8902511596679688 p-value: 0.0 The distribution of Max Diastolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Diastolic Pressure: Statistic: 0.9921280741691589 p-value: 1.3540282213339945e-25 The distribution of Min Diastolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Diastolic Pressure: Statistic: 0.984017014503479 p-value: 4.607433333966349e-35 The distribution of Mean Diastolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Temperature: Statistic: 0.9718073606491089 p-value: 8.828180325246348e-44 The distribution of Max Temperature is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Temperature: Statistic: 0.9212244153022766 p-value: 0.0 The distribution of Min Temperature is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Temperature: Statistic: 0.9613032341003418 p-value: 0.0 The distribution of Mean Temperature is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Lactate: Statistic: 0.7228389382362366 p-value: 0.0 The distribution of Max Lactate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Lactate: Statistic: 0.6343636512756348 p-value: 0.0 The distribution of Min Lactate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Lactate: Statistic: 0.7003828287124634 p-value: 0.0 The distribution of Mean Lactate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max pH: Statistic: 0.9707695245742798 p-value: 2.2420775429197073e-44 The distribution of Max pH is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min pH: Statistic: 0.9386768341064453 p-value: 0.0 The distribution of Min pH is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean pH: Statistic: 0.9484682083129883 p-value: 0.0 The distribution of Mean pH is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Glucose: Statistic: 0.7231720685958862 p-value: 0.0 The distribution of Max Glucose is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Glucose: Statistic: 0.8644881248474121 p-value: 0.0 The distribution of Min Glucose is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Glucose: Statistic: 0.8101949095726013 p-value: 0.0 The distribution of Mean Glucose is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max WBC: Statistic: 0.4640159606933594 p-value: 0.0 The distribution of Max WBC is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min WBC: Statistic: 0.5044788718223572 p-value: 0.0 The distribution of Min WBC is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean WBC: Statistic: 0.46362489461898804 p-value: 0.0 The distribution of Mean WBC is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max BUN: Statistic: 0.7306792736053467 p-value: 0.0 The distribution of Max BUN is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min BUN: Statistic: 0.7287629842758179 p-value: 0.0 The distribution of Min BUN is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean BUN: Statistic: 0.7304162383079529 p-value: 0.0 The distribution of Mean BUN is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Creatinine: Statistic: 0.5539169311523438 p-value: 0.0 The distribution of Max Creatinine is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Creatinine: Statistic: 0.5376086235046387 p-value: 0.0 The distribution of Min Creatinine is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Creatinine: Statistic: 0.5448676347732544 p-value: 0.0 The distribution of Mean Creatinine is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Hemoglobin: Statistic: 0.9970400333404541 p-value: 4.320630784165934e-15 The distribution of Max Hemoglobin is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Hemoglobin: Statistic: 0.9944753050804138 p-value: 1.8967766414239853e-21 The distribution of Min Hemoglobin is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Hemoglobin: Statistic: 0.9793708920478821 p-value: 7.3712713313648e-39 The distribution of Mean Hemoglobin is significantly different from normal. ---------------------------------------------------------------------
/usr/local/lib/python3.10/dist-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")
Visualization Exploration¶
Demographics¶
plt.figure(figsize=(10, 6))
sns.boxplot(data=cont_dem_df)
plt.title("Box Plot for Continuous - Demographics")
plt.show()
Vital Signs¶
plt.figure(figsize=(40, 24))
sns.boxplot(data=cont_vital_df)
plt.title("Box Plots for Continuous - Vital Signs")
plt.show()
Laboratory Results¶
plt.figure(figsize=(40, 24))
sns.boxplot(data=cont_lab_df)
plt.title("Box Plots for Continuous - Lab 1")
plt.show()
plt.figure(figsize=(40, 24))
sns.boxplot(data=cont_lab2_df)
plt.title("Box Plots for Continuous - Lab 2")
plt.show()
contVar = df.select_dtypes(include=['float64']).columns
Q1 = df[contVar.values].quantile(0.25)
Q3 = df[contVar.values].quantile(0.75)
IQR = Q3 - Q1
# Find outliers for each continuous variable
outliersCount = ((df[contVar.values] < (Q1 - 1.5 * IQR)) | (df[contVar.values] > (Q3 + 1.5 * IQR))).sum()
# Display the number of outliers for each variable
print("Number of outliers for each continuous variable:")
print(outliersCount)
Number of outliers for each continuous variable: Max Heart Rate 205 Min Heart Rate 294 Mean Heart Rate 166 Max MAP 724 Min MAP 657 Mean MAP 311 Max Systolic Pressure 273 Min Systolic Pressure 415 Mean Systolic Pressure 296 Max Diastolic Pressure 403 Min Diastolic Pressure 368 Mean Diastolic Pressure 253 Max Temperature 291 Min Temperature 390 Mean Temperature 279 Max Lactate 992 Min Lactate 1005 Mean Lactate 919 Max pH 258 Min pH 477 Mean pH 362 Max Glucose 766 Min Glucose 535 Mean Glucose 785 Max WBC 456 Min WBC 430 Mean WBC 413 Max BUN 1082 Min BUN 1032 Mean BUN 1132 Max Creatinine 1275 Min Creatinine 1315 Mean Creatinine 1361 Max Hemoglobin 92 Min Hemoglobin 76 Mean Hemoglobin 137 dtype: int64
import matplotlib.pyplot as plt
# Set a consistent figure size for all plots
fig_size = (8, 6)
for column in binary_df.columns[1:]:
# Calculate the percentage of patients in each category for each group
percentages = binary_df.groupby('Hospital Mortality')[column].value_counts(normalize=True) * 100
# Bar graph
plt.figure(figsize=fig_size) # Set the figure size
ax = percentages.unstack().plot(kind='bar')
# Add percentage labels on top of each bar
for p in ax.patches:
ax.annotate(f'{p.get_height():.2f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.title(f'Percentage of Patients in Each Category of {column} by Hospital Mortality')
plt.xlabel('Hospital Mortality')
plt.ylabel('Percentage')
plt.xticks(ticks=[0, 1], labels=['Survived', 'Died'], rotation=0) # Set rotation to 0 for horizontal labels
# legend outside the plot area
plt.legend(title=column, bbox_to_anchor=(1.05, 1), loc='upper left')
# Add an extra tick
plt.yticks(list(plt.yticks()[0]) + [plt.yticks()[0][-1] + 10])
plt.show()
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
# correlation_matrix = df[contVar].corr()
# plt.figure(figsize=(22, 20))
# sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=.5)
# plt.title("Heatmap of Correlation Matrix")
# plt.show()
Correlation Matrix without max and min variables¶
# Exclude 'Hospital Mortality' and columns containing 'Min' or 'Max'
cols_to_exclude = ['Hospital Mortality'] + [col for col in cont_all_df.columns if 'Min' in col or 'Max' in col]
# correlation matrix
corr_matrix = cont_all_df.drop(cols_to_exclude, axis=1).corr()
# Round the correlation values to 2 decimal places
corr_matrix = corr_matrix.round(2)
# heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix for Continuous Variables')
plt.show()
plt.figure(figsize=(14, 8))
for variable in contVar:
sns.pointplot(x='Hospital Mortality', y=variable, data=df, errorbar='sd', dodge=True)
plt.title(f'Point Plot of {variable}')
plt.xlabel('Hospital Mortality')
plt.ylabel(f'{variable}')
plt.show()
log_transformed_data = cont_all_df.copy()
for column in cont_all_df.columns:
if column != 'Hospital Mortality':
log_transformed_data[column] = np.log(cont_all_df[column] + 1)
print(log_transformed_data.head())
Hospital Mortality Age Max Heart Rate Min Heart Rate \
0 0 4.356709 5.129899 4.330733
1 1 3.761200 4.718499 4.418841
2 1 4.290459 4.663439 4.276666
9 1 4.290459 4.007333 3.891820
13 0 4.343805 4.867534 4.521789
Mean Heart Rate Max MAP Min MAP Mean MAP Max Systolic Pressure \
0 4.725490 5.560682 3.713572 4.339808 5.384495
1 4.537961 4.890349 4.219508 4.603669 5.384495
2 4.463936 4.709530 4.304065 4.490881 5.111988
9 3.955672 4.605170 4.158883 4.393499 4.990433
13 4.668946 4.488636 3.995137 4.304065 4.744932
Min Systolic Pressure Mean Systolic Pressure Max Diastolic Pressure \
0 4.174387 4.644006 4.317488
1 4.672829 5.077515 4.682131
2 4.595120 4.861583 4.382027
9 4.454347 4.746269 4.317488
13 4.219508 4.557030 4.343805
Min Diastolic Pressure Mean Diastolic Pressure Max Temperature \
0 3.367296 4.038127 3.653252
1 3.988984 4.388568 3.660709
2 4.025352 4.187922 3.654978
9 3.891820 4.109612 3.639047
13 3.663562 4.148675 3.660709
Min Temperature Mean Temperature Max Lactate Min Lactate Mean Lactate \
0 3.616309 3.637662 2.282382 1.131402 1.769855
1 3.597312 3.638885 1.308333 1.064711 1.202972
2 3.597312 3.627434 2.778819 1.098612 2.363680
9 3.633191 3.636489 0.916291 0.875469 0.896088
13 3.572658 3.624933 1.335001 1.335001 1.335001
Max pH Min pH Mean pH Max Glucose Min Glucose Mean Glucose \
0 2.150599 2.111425 2.123458 5.703782 4.521789 5.420092
1 2.131797 2.127041 2.129421 5.187386 4.867534 5.041811
2 2.122262 2.081938 2.104134 5.257495 4.488636 5.129899
9 2.136531 2.135349 2.136531 4.762174 4.624973 4.700480
13 2.118662 2.118662 2.118662 4.927254 4.553877 4.757891
Max WBC Min WBC Mean WBC Max BUN Min BUN Mean BUN \
0 3.234749 2.509599 2.904713 3.988984 3.737670 3.823192
1 2.687847 2.140066 2.451005 2.890372 2.833213 2.862201
2 2.240710 2.174752 2.208274 3.688879 3.367296 3.540959
9 2.066863 2.066863 2.066863 2.639057 2.484907 2.564949
13 2.954910 2.954910 2.954910 4.025352 3.761200 3.901973
Max Creatinine Min Creatinine Mean Creatinine Max Hemoglobin \
0 1.435085 1.223775 1.294727 2.624669
1 0.875469 0.788457 0.832909 2.797281
2 0.993252 0.832909 0.916291 2.660260
9 0.641854 0.530628 0.587787 2.451005
13 1.280934 1.029619 1.163151 2.602690
Min Hemoglobin Mean Hemoglobin
0 2.174752 2.401525
1 2.631889 2.714695
2 2.174752 2.418589
9 2.451005 2.451005
13 2.602690 2.602690
Data Distribution after normalization. There is an improvement in terms of normality for some variables but the rest are still skewed so we will be using Mann-WHitney test for most of the continuous variables.
Visualization after log tranformation method¶
for column in log_transformed_data.columns:
if column != 'Hospital Mortality':
plot_histogram_qqplot(log_transformed_data, column)
Shapiro Wilk Test After Log Transformation¶
Applied Shapiro Wilk test on the normalized data using log transformation. The result still shows that the data are not normally distruted however visual inspection tells us otherwise.
# Perform Shapiro-Wilk test on each column in log_transformed_data
for column in log_transformed_data.columns[1:]:
stat, p = stats.shapiro(log_transformed_data[column])
print(f"Shapiro-Wilk test for {column}:")
print(f" Statistic: {stat}")
print(f" p-value: {p}")
print("")
if p < 0.05:
print(f"The distribution of {column} is significantly different from normal.")
print("---------------------------------------------------------------------")
else:
print(f"The distribution of {column} is not significantly different from normal.")
print("---------------------------------------------------------------------")
Shapiro-Wilk test for Age: Statistic: 0.8753691911697388 p-value: 0.0 The distribution of Age is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Heart Rate: Statistic: 0.9981642365455627 p-value: 4.8528583929119407e-11 The distribution of Max Heart Rate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Heart Rate: Statistic: 0.8807404041290283 p-value: 0.0 The distribution of Min Heart Rate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Heart Rate: Statistic: 0.9987273216247559 p-value: 2.0667322075951233e-08 The distribution of Mean Heart Rate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max MAP: Statistic: 0.8827296495437622 p-value: 0.0 The distribution of Max MAP is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min MAP: Statistic: 0.8855717778205872 p-value: 0.0 The distribution of Min MAP is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean MAP: Statistic: 0.9907329678535461 p-value: 1.228228599741419e-27 The distribution of Mean MAP is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Systolic Pressure: Statistic: 0.9919484257698059 p-value: 7.146964553654495e-26 The distribution of Max Systolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Systolic Pressure: Statistic: 0.8003720641136169 p-value: 0.0 The distribution of Min Systolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Systolic Pressure: Statistic: 0.989392876625061 p-value: 2.166625597062738e-29 The distribution of Mean Systolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Diastolic Pressure: Statistic: 0.980971097946167 p-value: 1.232273283038572e-37 The distribution of Max Diastolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Diastolic Pressure: Statistic: 0.9163969159126282 p-value: 0.0 The distribution of Min Diastolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Diastolic Pressure: Statistic: 0.9966349005699158 p-value: 2.676607172664644e-16 The distribution of Mean Diastolic Pressure is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Temperature: Statistic: 0.9670045375823975 p-value: 0.0 The distribution of Max Temperature is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Temperature: Statistic: 0.898748517036438 p-value: 0.0 The distribution of Min Temperature is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Temperature: Statistic: 0.9532508850097656 p-value: 0.0 The distribution of Mean Temperature is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Lactate: Statistic: 0.9504865407943726 p-value: 0.0 The distribution of Max Lactate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Lactate: Statistic: 0.9007787108421326 p-value: 0.0 The distribution of Min Lactate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Lactate: Statistic: 0.9346543550491333 p-value: 0.0 The distribution of Mean Lactate is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max pH: Statistic: 0.968603253364563 p-value: 1.401298464324817e-45 The distribution of Max pH is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min pH: Statistic: 0.9329463243484497 p-value: 0.0 The distribution of Min pH is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean pH: Statistic: 0.9446377158164978 p-value: 0.0 The distribution of Mean pH is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Glucose: Statistic: 0.9644086956977844 p-value: 0.0 The distribution of Max Glucose is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Glucose: Statistic: 0.9811251759529114 p-value: 1.6319274165958327e-37 The distribution of Min Glucose is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Glucose: Statistic: 0.961662769317627 p-value: 0.0 The distribution of Mean Glucose is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max WBC: Statistic: 0.9587361216545105 p-value: 0.0 The distribution of Max WBC is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min WBC: Statistic: 0.9652097225189209 p-value: 0.0 The distribution of Min WBC is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean WBC: Statistic: 0.956272304058075 p-value: 0.0 The distribution of Mean WBC is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max BUN: Statistic: 0.9715502858161926 p-value: 6.305843089461677e-44 The distribution of Max BUN is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min BUN: Statistic: 0.9770194888114929 p-value: 1.5867182771242769e-40 The distribution of Min BUN is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean BUN: Statistic: 0.9725632071495056 p-value: 2.4242463432819335e-43 The distribution of Mean BUN is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Creatinine: Statistic: 0.8170948028564453 p-value: 0.0 The distribution of Max Creatinine is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Creatinine: Statistic: 0.800369918346405 p-value: 0.0 The distribution of Min Creatinine is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Creatinine: Statistic: 0.8064248561859131 p-value: 0.0 The distribution of Mean Creatinine is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Max Hemoglobin: Statistic: 0.9962196350097656 p-value: 1.9383733575538924e-17 The distribution of Max Hemoglobin is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Min Hemoglobin: Statistic: 0.9910911917686462 p-value: 3.893604805505054e-27 The distribution of Min Hemoglobin is significantly different from normal. --------------------------------------------------------------------- Shapiro-Wilk test for Mean Hemoglobin: Statistic: 0.9963561296463013 p-value: 4.490599763789102e-17 The distribution of Mean Hemoglobin is significantly different from normal. ---------------------------------------------------------------------
/usr/local/lib/python3.10/dist-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")
L2 Normalization cannot normalize the data for statistical tests.
X = cont_all_df.to_numpy()
norms = np.linalg.norm(X, axis=0)
normalized_data = X / norms
normalized_df = pd.DataFrame(normalized_data, columns=cont_all_df.columns)
normalized_df.head()
| Hospital Mortality | Age | Max Heart Rate | Min Heart Rate | Mean Heart Rate | Max MAP | Min MAP | Mean MAP | Max Systolic Pressure | Min Systolic Pressure | Mean Systolic Pressure | Max Diastolic Pressure | Min Diastolic Pressure | Mean Diastolic Pressure | Max Temperature | Min Temperature | Mean Temperature | Max Lactate | Min Lactate | Mean Lactate | Max pH | Min pH | Mean pH | Max Glucose | Min Glucose | Mean Glucose | Max WBC | Min WBC | Mean WBC | Max BUN | Min BUN | Mean BUN | Max Creatinine | Min Creatinine | Mean Creatinine | Max Hemoglobin | Min Hemoglobin | Mean Hemoglobin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.010717 | 0.013675 | 0.009115 | 0.011144 | 0.020411 | 0.006102 | 0.008584 | 0.012558 | 0.006451 | 0.007799 | 0.007731 | 0.005592 | 0.008140 | 0.008910 | 0.008987 | 0.008956 | 0.018049 | 0.008220 | 0.013812 | 0.009139 | 0.008911 | 0.008942 | 0.012649 | 0.007068 | 0.013003 | 0.011270 | 0.007406 | 0.009490 | 0.014316 | 0.013204 | 0.013219 | 0.014234 | 0.013160 | 0.013121 | 0.009110 | 0.006999 | 0.008081 |
| 1 | 0.021527 | 0.005846 | 0.009036 | 0.009965 | 0.009222 | 0.010402 | 0.010220 | 0.011210 | 0.012558 | 0.010684 | 0.012072 | 0.011178 | 0.010584 | 0.011617 | 0.008979 | 0.008813 | 0.008967 | 0.005538 | 0.007437 | 0.006608 | 0.008946 | 0.009071 | 0.009003 | 0.007530 | 0.010019 | 0.008889 | 0.006328 | 0.004916 | 0.005828 | 0.004592 | 0.005153 | 0.004874 | 0.006227 | 0.006580 | 0.006437 | 0.010960 | 0.011576 | 0.011349 |
| 2 | 0.021527 | 0.010021 | 0.008547 | 0.008629 | 0.008557 | 0.008669 | 0.011135 | 0.010003 | 0.009549 | 0.009878 | 0.009713 | 0.008253 | 0.010984 | 0.009479 | 0.008926 | 0.008813 | 0.008863 | 0.030971 | 0.007828 | 0.027312 | 0.008850 | 0.008617 | 0.008748 | 0.008080 | 0.006835 | 0.009713 | 0.003880 | 0.005112 | 0.004453 | 0.010535 | 0.009017 | 0.009896 | 0.007562 | 0.007128 | 0.007427 | 0.009466 | 0.006999 | 0.008234 |
| 3 | 0.021527 | 0.010021 | 0.004396 | 0.005833 | 0.005107 | 0.007802 | 0.009610 | 0.009064 | 0.008449 | 0.008567 | 0.008647 | 0.007731 | 0.009586 | 0.008754 | 0.008781 | 0.009144 | 0.008945 | 0.003077 | 0.005480 | 0.004112 | 0.008995 | 0.009157 | 0.009076 | 0.004907 | 0.007844 | 0.006302 | 0.003187 | 0.004522 | 0.003794 | 0.003512 | 0.003542 | 0.003545 | 0.004003 | 0.003838 | 0.003961 | 0.007544 | 0.009512 | 0.008532 |
| 4 | 0.000000 | 0.010578 | 0.010501 | 0.011059 | 0.010526 | 0.006935 | 0.008136 | 0.008279 | 0.006597 | 0.006753 | 0.007143 | 0.007940 | 0.007589 | 0.009108 | 0.008979 | 0.008592 | 0.008840 | 0.005743 | 0.010959 | 0.007941 | 0.008814 | 0.008985 | 0.008893 | 0.005796 | 0.007301 | 0.006678 | 0.008406 | 0.011929 | 0.010006 | 0.014857 | 0.013526 | 0.014326 | 0.011565 | 0.009870 | 0.010893 | 0.008896 | 0.011217 | 0.010061 |
Visualization after L2 Normalization¶
for column in normalized_df.columns:
if column != 'Hospital Mortality':
plot_histogram_qqplot(normalized_df, column)
# Apply Box-Cox transformation to all numeric columns
box_cox_data = cont_all_df.copy() # Create a copy of the original DataFrame
# Define a function to apply Box-Cox transformation
def apply_boxcox(x):
# Skip non-numeric columns
if not pd.api.types.is_numeric_dtype(x):
return x
# Check if the column contains any zero or negative values
if (x <= 0).any():
# Add a small constant to ensure all values are strictly positive
x += np.abs(x.min()) + 1e-6
# Apply Box-Cox transformation
transformed_data, _ = boxcox(x)
return transformed_data
# Apply the function to all columns using applymap
box_cox_df = box_cox_data.applymap(apply_boxcox)
# Now transformed_df contains Box-Cox transformed values for all numeric columns
Visualization After BoxCox Method¶
for column in normalized_df.columns:
if column != 'Hospital Mortality':
plot_histogram_qqplot(box_cox_df, column)
scaler = MinMaxScaler()
# Apply Min-Max scaling to all columns
scaled_data = scaler.fit_transform(cont_all_df)
# Convert the scaled data array back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=cont_all_df.columns)
Visualization After Min-Max Method¶
# Iterate through columns and create plots
for column in scaled_df.columns:
if column != 'Hospital Mortality': # Skip the target column
plot_histogram_qqplot(scaled_df, column)
Statistical Tests¶
# Create an empty dictionary to store results assuming not normal distribution
result_dict = {'Variable': [], 'Data_Type': [], 'Type_of_Test': [], 'P-value': []}
# Create an empty dictionary to store results assuming normal distribution
# result_dict_n = {'Variable': [], 'Data_Type': [], 'Type_of_Test': [], 'P-value': []}
If we use the dataset without the missing values, majority of results for the ttest are statistically significant. However, since the data are not normally distributed, we will use a nonpametric test.
Assumptions of independent t-test
- Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
- No significant outliers in the two groups
- Normality. the data for each group should be approximately normally distributed. (Central Limit Theorem)
- Homogeneity of variances. the variance of the outcome variable should be equal in each group. Recall that, the Welch t-test does not make this assumptions.
Levene Test for homogeneity¶
- H0: the variances of groups are equal
- H1: the variances of groups are NOT equal
All columns return p value less than 0.05. This mean the variances of groups are NOT equal.
# Conduct a levene test on cont_all_df to test homogeneity of variances.
# Extract the numeric columns
numeric_cols = cont_all_df.select_dtypes(include=['int64', 'float64']).columns
# Initialize empty lists to store results
levene_results = []
# Loop through each numeric column
for col in numeric_cols:
# Perform Levene's test
levene_statistic, levene_pvalue = stats.levene(cont_all_df[col][cont_all_df['Hospital Mortality'] == 0], cont_all_df[col][cont_all_df['Hospital Mortality'] == 1])
# Store the results
if levene_pvalue > 0.05:
levene_results.append({'Variable': col, 'Levene Statistic': levene_statistic, 'Levene p-value': levene_pvalue, 'Homogeneity': 'Yes'})
else:
levene_results.append({'Variable': col, 'Levene Statistic': levene_statistic, 'Levene p-value': levene_pvalue, 'Homogeneity': 'No'})
# Create a DataFrame to display the results
levene_df = pd.DataFrame(levene_results)
# Print the DataFrame
print(levene_df.to_string())
Variable Levene Statistic Levene p-value Homogeneity 0 Hospital Mortality NaN NaN No 1 Age 9.766476 1.781301e-03 No 2 Max Heart Rate 228.954740 2.867679e-51 No 3 Min Heart Rate 381.172645 1.214381e-83 No 4 Mean Heart Rate 242.538863 3.550201e-54 No 5 Max MAP 51.127977 9.135754e-13 No 6 Min MAP 280.885248 2.308827e-62 No 7 Mean MAP 128.331355 1.324610e-29 No 8 Max Systolic Pressure 151.111596 1.568772e-34 No 9 Min Systolic Pressure 309.132049 2.244147e-68 No 10 Mean Systolic Pressure 225.880693 1.306078e-50 No 11 Max Diastolic Pressure 67.658065 2.136216e-16 No 12 Min Diastolic Pressure 167.034840 5.756374e-38 No 13 Mean Diastolic Pressure 77.281635 1.674975e-18 No 14 Max Temperature 586.364446 1.255342e-126 No 15 Min Temperature 329.538161 1.039833e-72 No 16 Mean Temperature 644.300648 1.228729e-138 No 17 Max Lactate 850.401647 6.381367e-181 No 18 Min Lactate 820.758774 7.002810e-175 No 19 Mean Lactate 1041.349674 1.656900e-219 No 20 Max pH 364.313198 4.453588e-80 No 21 Min pH 940.224449 3.843229e-199 No 22 Mean pH 1061.326595 1.638447e-223 No 23 Max Glucose 241.999709 4.629889e-54 No 24 Min Glucose 361.134361 2.095750e-79 No 25 Mean Glucose 393.565497 2.934593e-86 No 26 Max WBC 146.193383 1.811252e-33 No 27 Min WBC 224.203857 2.986887e-50 No 28 Mean WBC 190.961710 4.069839e-43 No 29 Max BUN 373.597778 4.843634e-82 No 30 Min BUN 453.030987 8.886399e-99 No 31 Mean BUN 417.703660 2.390378e-91 No 32 Max Creatinine 149.299426 3.863182e-34 No 33 Min Creatinine 167.901130 3.744817e-38 No 34 Mean Creatinine 158.743041 3.534169e-36 No 35 Max Hemoglobin 59.074669 1.631323e-14 No 36 Min Hemoglobin 15.581030 7.947696e-05 No 37 Mean Hemoglobin 39.784147 2.932115e-10 No
/usr/local/lib/python3.10/dist-packages/scipy/stats/_morestats.py:3189: RuntimeWarning: invalid value encountered in scalar divide W = numer / denom
Welch's t-test¶
tranformed_log = log_transformed_data.copy()
welch_data = tranformed_log[['Hospital Mortality', 'Max Heart Rate', 'Mean Heart Rate', 'Mean MAP', 'Mean Systolic Pressure', 'Mean Diastolic Pressure', 'Mean BUN', 'Max Hemoglobin', 'Mean Hemoglobin']]
tranformed_log['Hospital Mortality'].value_counts()
0 10331 1 2158 Name: Hospital Mortality, dtype: int64
# prompt: perform a welch's t-test on cont_all_df ['Hospital Mortality'] == 0 and cont_all_df ['Hospital Mortality'] == 1
for column in welch_data.columns[1:]:
t_statistic, p_value = stats.ttest_ind(welch_data[column][welch_data['Hospital Mortality'] == 0], welch_data[column][welch_data['Hospital Mortality'] == 1], equal_var=False)
print(column)
# Print or use the results as needed
print(f"Welch's T-test for {column}:")
print(f" T-statistic: {t_statistic}")
print(f" P-value: {p_value}")
print("")
if p_value < 0.05:
print(f"The difference in {column} between survivors and non-survivors is statistically significant.")
print("---------------------------------------------------------------------")
else:
print(f"There is no significant difference in {column} between survivors and non-survivors.")
print("---------------------------------------------------------------------")
result_dict['Variable'].append(column)
result_dict['Data_Type'].append('Continuous')
result_dict['Type_of_Test'].append('Welch\'s T-test')
result_dict['P-value'].append(p_value)
Max Heart Rate Welch's T-test for Max Heart Rate: T-statistic: -14.425352999038484 P-value: 1.5137029803818268e-45 The difference in Max Heart Rate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Heart Rate Welch's T-test for Mean Heart Rate: T-statistic: -9.820534110448351 P-value: 2.1132933705136722e-22 The difference in Mean Heart Rate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean MAP Welch's T-test for Mean MAP: T-statistic: 9.963466996165206 P-value: 5.409360708044105e-23 The difference in Mean MAP between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Systolic Pressure Welch's T-test for Mean Systolic Pressure: T-statistic: 11.317210031572678 P-value: 4.935026168815372e-29 The difference in Mean Systolic Pressure between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Diastolic Pressure Welch's T-test for Mean Diastolic Pressure: T-statistic: 8.723910851015027 P-value: 4.528457004895649e-18 The difference in Mean Diastolic Pressure between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean BUN Welch's T-test for Mean BUN: T-statistic: -27.18572952625232 P-value: 4.778132353060818e-145 The difference in Mean BUN between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Hemoglobin Welch's T-test for Max Hemoglobin: T-statistic: 9.42725880042387 P-value: 8.444630921832603e-21 The difference in Max Hemoglobin between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Hemoglobin Welch's T-test for Mean Hemoglobin: T-statistic: 0.8239296708080953 P-value: 0.41004708014899816 There is no significant difference in Mean Hemoglobin between survivors and non-survivors. ---------------------------------------------------------------------
Mann-Whitney U test¶
columns_to_exclude = ['Mean Heart Rate', 'Max Heart Rate','Mean MAP', 'Mean Systolic Pressure', 'Mean Diastolic Pressure', 'Mean BUN', 'Max Hemoglobin', 'Mean Hemoglobin']
# Dropping the specified columns
mann_cont_data = cont_all_df.drop(columns=columns_to_exclude)
# Separate data into two groups based on target
survivors = mann_cont_data[mann_cont_data['Hospital Mortality'] == 0]
non_survivors = mann_cont_data[mann_cont_data['Hospital Mortality'] == 1]
# Perform Mann-Whitney U test for each numerical column
for column in mann_cont_data.columns[1:]:
stat, p = stats.mannwhitneyu(survivors[column], non_survivors[column])
print(column)
# Print or use the results as needed
print(f"Mann-Whitney U test for {column}:")
print(f" Statistic: {stat}")
print(f" P-value: {p}")
print("")
if p < 0.05:
print(f"The difference in {column} between survivors and non-survivors is statistically significant.")
print("---------------------------------------------------------------------")
else:
print(f"There is no significant difference in {column} between survivors and non-survivors.")
print("---------------------------------------------------------------------")
# Append results to the dictionary
result_dict['Variable'].append(column)
result_dict['Data_Type'].append('Continuous')
result_dict['Type_of_Test'].append('Mann-Whitney U')
result_dict['P-value'].append(p)
Age Mann-Whitney U test for Age: Statistic: 8525675.5 P-value: 2.1525320528626538e-66 The difference in Age between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min Heart Rate Mann-Whitney U test for Min Heart Rate: Statistic: 11076044.0 P-value: 0.6405817553911917 There is no significant difference in Min Heart Rate between survivors and non-survivors. --------------------------------------------------------------------- Max MAP Mann-Whitney U test for Max MAP: Statistic: 11096635.5 P-value: 0.7401547602044709 There is no significant difference in Max MAP between survivors and non-survivors. --------------------------------------------------------------------- Min MAP Mann-Whitney U test for Min MAP: Statistic: 14278470.0 P-value: 5.902742296218549e-94 The difference in Min MAP between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Systolic Pressure Mann-Whitney U test for Max Systolic Pressure: Statistic: 11397950.5 P-value: 0.09964242690167496 There is no significant difference in Max Systolic Pressure between survivors and non-survivors. --------------------------------------------------------------------- Min Systolic Pressure Mann-Whitney U test for Min Systolic Pressure: Statistic: 14238452.0 P-value: 1.3500531808846305e-91 The difference in Min Systolic Pressure between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Diastolic Pressure Mann-Whitney U test for Max Diastolic Pressure: Statistic: 10874113.0 P-value: 0.07300896056630023 There is no significant difference in Max Diastolic Pressure between survivors and non-survivors. --------------------------------------------------------------------- Min Diastolic Pressure Mann-Whitney U test for Min Diastolic Pressure: Statistic: 14159804.0 P-value: 3.8903901136352047e-87 The difference in Min Diastolic Pressure between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Temperature Mann-Whitney U test for Max Temperature: Statistic: 11815671.5 P-value: 1.1390806265114146e-05 The difference in Max Temperature between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min Temperature Mann-Whitney U test for Min Temperature: Statistic: 12427119.5 P-value: 4.344242710802271e-17 The difference in Min Temperature between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Temperature Mann-Whitney U test for Mean Temperature: Statistic: 12364791.5 P-value: 1.3125499468833547e-15 The difference in Mean Temperature between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Lactate Mann-Whitney U test for Max Lactate: Statistic: 8190966.0 P-value: 6.177760645508349e-84 The difference in Max Lactate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min Lactate Mann-Whitney U test for Min Lactate: Statistic: 7130572.5 P-value: 1.4044905844472804e-153 The difference in Min Lactate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Lactate Mann-Whitney U test for Mean Lactate: Statistic: 7683017.5 P-value: 1.694887046403263e-114 The difference in Mean Lactate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max pH Mann-Whitney U test for Max pH: Statistic: 12850534.0 P-value: 4.311982639570609e-29 The difference in Max pH between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min pH Mann-Whitney U test for Min pH: Statistic: 13348631.0 P-value: 2.1446956365998007e-47 The difference in Min pH between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean pH Mann-Whitney U test for Mean pH: Statistic: 13346745.5 P-value: 2.205225301421819e-47 The difference in Mean pH between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Glucose Mann-Whitney U test for Max Glucose: Statistic: 9378364.5 P-value: 3.5942111872237883e-31 The difference in Max Glucose between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min Glucose Mann-Whitney U test for Min Glucose: Statistic: 9174274.5 P-value: 2.2841251393337134e-38 The difference in Min Glucose between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Glucose Mann-Whitney U test for Mean Glucose: Statistic: 8716584.0 P-value: 2.594076017013178e-57 The difference in Mean Glucose between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max WBC Mann-Whitney U test for Max WBC: Statistic: 10015821.0 P-value: 1.1117900361048767e-13 The difference in Max WBC between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min WBC Mann-Whitney U test for Min WBC: Statistic: 9939952.5 P-value: 2.281025809249482e-15 The difference in Min WBC between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean WBC Mann-Whitney U test for Mean WBC: Statistic: 9949532.5 P-value: 3.781638033820371e-15 The difference in Mean WBC between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max BUN Mann-Whitney U test for Max BUN: Statistic: 6910567.0 P-value: 2.0237352811575977e-170 The difference in Max BUN between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min BUN Mann-Whitney U test for Min BUN: Statistic: 7025303.0 P-value: 1.7651631567022414e-161 The difference in Min BUN between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Creatinine Mann-Whitney U test for Max Creatinine: Statistic: 7640747.0 P-value: 6.435699951912311e-118 The difference in Max Creatinine between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min Creatinine Mann-Whitney U test for Min Creatinine: Statistic: 8025284.5 P-value: 3.9910953959453173e-94 The difference in Min Creatinine between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Creatinine Mann-Whitney U test for Mean Creatinine: Statistic: 7753167.5 P-value: 4.6757947488289196e-110 The difference in Mean Creatinine between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Min Hemoglobin Mann-Whitney U test for Min Hemoglobin: Statistic: 10611817.0 P-value: 0.0004403852646089298 The difference in Min Hemoglobin between survivors and non-survivors is statistically significant. ---------------------------------------------------------------------
The following Mann-Whitney test is conducted for the variables that became normally dristributed after doing the log transformation. We created this subset to compare the result with the Welch T-Test results.
subset_welch = cont_all_df.copy()
MW_comparison = subset_welch[['Hospital Mortality', 'Max Heart Rate', 'Mean Heart Rate', 'Mean MAP', 'Mean Systolic Pressure', 'Mean Diastolic Pressure', 'Mean BUN', 'Max Hemoglobin', 'Mean Hemoglobin']]
# Separate data into two groups based on target
survivors = MW_comparison[MW_comparison['Hospital Mortality'] == 0]
non_survivors = MW_comparison[MW_comparison['Hospital Mortality'] == 1]
# Perform Mann-Whitney U test for each numerical column
for column in MW_comparison.columns[1:]:
stat, p = stats.mannwhitneyu(survivors[column], non_survivors[column])
print(column)
# Print or use the results as needed
print(f"Mann-Whitney U test for {column}:")
print(f" Statistic: {stat}")
print(f" P-value: {p}")
print("")
if p < 0.05:
print(f"The difference in {column} between survivors and non-survivors is statistically significant.")
print("---------------------------------------------------------------------")
else:
print(f"There is no significant difference in {column} between survivors and non-survivors.")
print("---------------------------------------------------------------------")
Max Heart Rate Mann-Whitney U test for Max Heart Rate: Statistic: 8640638.5 P-value: 7.524132792815273e-61 The difference in Max Heart Rate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Heart Rate Mann-Whitney U test for Mean Heart Rate: Statistic: 9409815.0 P-value: 3.947274866302104e-30 The difference in Mean Heart Rate between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean MAP Mann-Whitney U test for Mean MAP: Statistic: 12733640.5 P-value: 2.1228661641088843e-25 The difference in Mean MAP between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Systolic Pressure Mann-Whitney U test for Mean Systolic Pressure: Statistic: 13073048.5 P-value: 1.224837336030013e-36 The difference in Mean Systolic Pressure between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Diastolic Pressure Mann-Whitney U test for Mean Diastolic Pressure: Statistic: 12469046.0 P-value: 4.0347388451056675e-18 The difference in Mean Diastolic Pressure between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean BUN Mann-Whitney U test for Mean BUN: Statistic: 6923118.0 P-value: 2.922259721412044e-169 The difference in Mean BUN between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Max Hemoglobin Mann-Whitney U test for Max Hemoglobin: Statistic: 12656776.0 P-value: 3.714891731259355e-23 The difference in Max Hemoglobin between survivors and non-survivors is statistically significant. --------------------------------------------------------------------- Mean Hemoglobin Mann-Whitney U test for Mean Hemoglobin: Statistic: 11249336.0 P-value: 0.5023332041707467 There is no significant difference in Mean Hemoglobin between survivors and non-survivors. ---------------------------------------------------------------------
Spearman Correlation¶
# # Create an empty dictionary to store the correlation results
# corr_results_cont = {'Variable_1': [], 'Variable_2': [], 'Correlation': [], 'Group': []}
# cols_to_exclude_s = [col for col in cont_all_df.columns if 'Min_x' in col or 'Max_x' in col]
# spearman_matrix = cont_all_df.drop(cols_to_exclude_s, axis=1)
# # Iterate over the ordinal variables, excluding 'Hospital Mortality'
# for col1 in spearman_matrix.columns[1:-1]:
# for col2 in spearman_matrix.columns[2:]:
# if col1 != col2:
# # Separate data into two groups based on target
# survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 0]
# non_survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 1]
# # Calculate Spearman's correlation coefficient for each group
# corr_survivors, _ = stats.spearmanr(survivors[col1], survivors[col2])
# corr_non_survivors, _ = stats.spearmanr(non_survivors[col1], non_survivors[col2])
# # Store the results
# corr_results_cont['Variable_1'].append(col1)
# corr_results_cont['Variable_2'].append(col2)
# corr_results_cont['Correlation'].append(corr_survivors)
# corr_results_cont['Group'].append('Survivors')
# corr_results_cont['Variable_1'].append(col1)
# corr_results_cont['Variable_2'].append(col2)
# corr_results_cont['Correlation'].append(corr_non_survivors)
# corr_results_cont['Group'].append('Non-Survivors')
# # Create a DataFrame from the results
# corr_df_cont = pd.DataFrame(corr_results_cont)
# # Print the DataFrame
# # print(corr_df_cont)
# import pandas as pd
# from scipy import stats
# Create an empty dictionary to store the correlation results
corr_results_cont = {'Variable_1': [], 'Variable_2': [], 'Test': [],'Correlation': [], 'P_Value': [], 'Group': []}
cols_to_exclude_s = [col for col in cont_all_df.columns if 'Min_x' in col or 'Max_x' in col]
spearman_matrix = cont_all_df.drop(cols_to_exclude_s, axis=1)
# Iterate over the ordinal variables, excluding 'Hospital Mortality'
for col1 in spearman_matrix.columns[1:-1]:
for col2 in spearman_matrix.columns[2:]:
if col1 != col2:
# Separate data into two groups based on target
survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 0]
non_survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 1]
# Calculate Spearman's correlation coefficient and p-value for each group
corr_survivors, p_survivors = stats.spearmanr(survivors[col1], survivors[col2])
corr_non_survivors, p_non_survivors = stats.spearmanr(non_survivors[col1], non_survivors[col2])
# Store the results
corr_results_cont['Variable_1'].append(col1)
corr_results_cont['Variable_2'].append(col2)
corr_results_cont['Correlation'].append(corr_survivors)
corr_results_cont['P_Value'].append(p_survivors)
corr_results_cont['Group'].append('Survivors')
corr_results_cont['Test'].append('Spearman')
corr_results_cont['Variable_1'].append(col1)
corr_results_cont['Variable_2'].append(col2)
corr_results_cont['Correlation'].append(corr_non_survivors)
corr_results_cont['P_Value'].append(p_non_survivors)
corr_results_cont['Group'].append('Non-Survivors')
corr_results_cont['Test'].append('Spearman')
# Create a DataFrame from the results
corr_df_cont = pd.DataFrame(corr_results_cont)
# filter by -1 to -.5 and .5 to 1
# Filter the correlation values
filtered_corr = corr_df_cont[(corr_df_cont['Correlation'] <= -0.5) | (corr_df_cont['Correlation'] >= 0.5)]
print(filtered_corr.shape)
# Sort the filtered results by Correlation in descending order
sorted_corr = filtered_corr.sort_values(by='Correlation', ascending=False)
sorted_corr['Correlation'] = sorted_corr['Correlation'].round(4)
sorted_corr['Second_Word_Variable_1'] = sorted_corr['Variable_1'].str.split().str[1]
sorted_corr['Second_Word_Variable_2'] = sorted_corr['Variable_2'].str.split().str[1]
sorted_corr = sorted_corr[sorted_corr['Second_Word_Variable_1'] != sorted_corr['Second_Word_Variable_2']]
print(sorted_corr.shape)
# Drop the helper column 'Second_Word_Variable_2'
sorted_corr = sorted_corr.drop(['Second_Word_Variable_2','Second_Word_Variable_1'], axis=1)
sorted_corr.reset_index(drop=True, inplace=True)
sorted_corr = sorted_corr[sorted_corr.index % 2 != 0]
print(sorted_corr.shape)
print(sorted_corr)
(204, 6)
(90, 8)
(45, 6)
Variable_1 Variable_2 Test Correlation \
1 Mean MAP Mean Diastolic Pressure Spearman 0.8582
3 Mean MAP Mean Diastolic Pressure Spearman 0.8569
5 Min Diastolic Pressure Min MAP Spearman 0.8189
7 Min Diastolic Pressure Min MAP Spearman 0.8148
9 Min Systolic Pressure Min MAP Spearman 0.7987
11 Mean MAP Mean Systolic Pressure Spearman 0.7556
13 Max MAP Max Diastolic Pressure Spearman 0.7452
15 Max Diastolic Pressure Max MAP Spearman 0.7362
17 Mean Creatinine Max BUN Spearman 0.7351
19 Mean Creatinine Mean BUN Spearman 0.7328
21 Min Creatinine Min BUN Spearman 0.7320
23 Mean BUN Min Creatinine Spearman 0.7314
25 Max Creatinine Max BUN Spearman 0.7298
27 Min Creatinine Max BUN Spearman 0.7186
29 Mean BUN Mean Creatinine Spearman 0.7161
31 Max Creatinine Mean BUN Spearman 0.7145
33 Min MAP Min Systolic Pressure Spearman 0.7143
35 Mean Creatinine Min BUN Spearman 0.7138
37 Max BUN Max Creatinine Spearman 0.7094
39 Min Creatinine Min BUN Spearman 0.7048
41 Max BUN Mean Creatinine Spearman 0.7046
43 Mean MAP Mean Systolic Pressure Spearman 0.7038
45 Min BUN Mean Creatinine Spearman 0.7038
47 Max Creatinine Mean BUN Spearman 0.7032
49 Mean BUN Min Creatinine Spearman 0.6957
51 Min BUN Max Creatinine Spearman 0.6833
53 Max Systolic Pressure Max MAP Spearman 0.6739
55 Max Creatinine Min BUN Spearman 0.6734
57 Max BUN Min Creatinine Spearman 0.6680
59 Max MAP Max Systolic Pressure Spearman 0.6467
61 Min Diastolic Pressure Min Systolic Pressure Spearman 0.6334
63 Min Diastolic Pressure Mean MAP Spearman 0.6235
65 Max Systolic Pressure Mean MAP Spearman 0.6105
67 Mean Diastolic Pressure Min MAP Spearman 0.6095
69 Max Diastolic Pressure Mean MAP Spearman 0.6088
71 Min Diastolic Pressure Mean MAP Spearman 0.6041
73 Max Diastolic Pressure Mean MAP Spearman 0.5847
75 Max MAP Mean Diastolic Pressure Spearman 0.5596
77 Max Diastolic Pressure Max Systolic Pressure Spearman 0.5512
79 Mean Systolic Pressure Min MAP Spearman 0.5493
81 Min Diastolic Pressure Min Systolic Pressure Spearman 0.5186
83 Mean MAP Min Systolic Pressure Spearman 0.5106
85 Mean Diastolic Pressure Min MAP Spearman 0.5080
87 Min Systolic Pressure Mean MAP Spearman 0.5078
89 Max Lactate Min pH Spearman -0.5302
P_Value Group
1 0.000000e+00 Survivors
3 0.000000e+00 Non-Survivors
5 0.000000e+00 Non-Survivors
7 0.000000e+00 Survivors
9 0.000000e+00 Non-Survivors
11 0.000000e+00 Non-Survivors
13 0.000000e+00 Non-Survivors
15 0.000000e+00 Survivors
17 0.000000e+00 Non-Survivors
19 0.000000e+00 Non-Survivors
21 0.000000e+00 Non-Survivors
23 0.000000e+00 Non-Survivors
25 0.000000e+00 Non-Survivors
27 0.000000e+00 Non-Survivors
29 0.000000e+00 Survivors
31 0.000000e+00 Non-Survivors
33 0.000000e+00 Survivors
35 0.000000e+00 Non-Survivors
37 0.000000e+00 Survivors
39 0.000000e+00 Survivors
41 0.000000e+00 Survivors
43 0.000000e+00 Survivors
45 0.000000e+00 Survivors
47 0.000000e+00 Survivors
49 0.000000e+00 Survivors
51 8.287704e-297 Non-Survivors
53 9.774152e-286 Non-Survivors
55 0.000000e+00 Survivors
57 0.000000e+00 Survivors
59 0.000000e+00 Survivors
61 2.256757e-242 Non-Survivors
63 0.000000e+00 Survivors
65 1.279377e-220 Non-Survivors
67 0.000000e+00 Survivors
69 0.000000e+00 Survivors
71 7.549115e-215 Non-Survivors
73 4.401237e-198 Non-Survivors
75 4.030514e-178 Non-Survivors
77 8.346475e-172 Non-Survivors
79 2.152481e-170 Non-Survivors
81 0.000000e+00 Survivors
83 0.000000e+00 Survivors
85 6.528197e-142 Non-Survivors
87 8.496490e-142 Non-Survivors
89 9.810846e-157 Non-Survivors
cont_spear_df = cont_all_df[["Hospital Mortality","Mean Diastolic Pressure", "Mean Systolic Pressure", "Mean BUN", "Mean Creatinine", "Mean MAP","Max Lactate","Min pH"]]
cont_spear_df.head(5)
| Hospital Mortality | Mean Diastolic Pressure | Mean Systolic Pressure | Mean BUN | Mean Creatinine | Mean MAP | Max Lactate | Min pH | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 55.720000 | 102.960000 | 44.75 | 2.65 | 75.692812 | 8.8 | 7.26 |
| 1 | 1 | 79.525000 | 159.375000 | 16.50 | 1.30 | 98.850000 | 2.7 | 7.39 |
| 2 | 1 | 64.885714 | 128.228571 | 33.50 | 1.50 | 88.200000 | 15.1 | 7.02 |
| 9 | 1 | 59.923077 | 114.153846 | 12.00 | 0.80 | 79.923077 | 1.5 | 7.46 |
| 13 | 0 | 62.350000 | 94.300000 | 48.50 | 2.20 | 73.000005 | 2.8 | 7.32 |
import seaborn as sns
sns.pairplot(data=cont_spear_df, hue="Hospital Mortality")
plt.show()
Density plot of continuous variables¶
Mann-Whitney U test¶
for column in ordinal_df.columns:
if column != 'Hospital Mortality':
stat, p = stats.mannwhitneyu(ordinal_df[column][ordinal_df['Hospital Mortality'] == 0],
ordinal_df[column][ordinal_df['Hospital Mortality'] == 1])
print(f"Mann-Whitney U test for {column}:")
print(f" Statistic: {stat}")
print(f" p-value: {p}")
print("---------------------------------------------------------------------")
# Append results to the dictionary
result_dict['Variable'].append(column)
result_dict['Data_Type'].append('Ordinal')
result_dict['Type_of_Test'].append('Mann-Whitney U')
result_dict['P-value'].append(p)
# result_dict_n['Variable'].append(column)
# result_dict_n['Data_Type'].append('Ordinal')
# result_dict_n['Type_of_Test'].append('Mann-Whitney U')
# result_dict_n['P-value'].append(p)
Mann-Whitney U test for SAPS II: Statistic: 5016710.5 p-value: 0.0 --------------------------------------------------------------------- Mann-Whitney U test for SOFA: Statistic: 6833386.5 p-value: 3.564756196880308e-178 --------------------------------------------------------------------- Mann-Whitney U test for OASIS: Statistic: 5437509.0 p-value: 6.727799650542416e-308 ---------------------------------------------------------------------
Spearman Correlation¶
Performed Spearman Correlation for the ordinal variables as supplementary
# Create an empty dictionary to store the correlation results
corr_results = {'Variable_1': [], 'Variable_2': [], 'Test': [], 'Correlation': [], 'P_value': [], 'Group': []}
# Iterate over the ordinal variables, excluding 'Hospital Mortality'
for col1 in ordinal_df.columns[1:-1]:
for col2 in ordinal_df.columns[2:]:
if col1 != col2:
# Separate data into two groups based on target
survivors = ordinal_df[ordinal_df['Hospital Mortality'] == 0]
non_survivors = ordinal_df[ordinal_df['Hospital Mortality'] == 1]
# Calculate Spearman's correlation coefficient and p-value for each group
corr_survivors, p_value_survivors = stats.spearmanr(survivors[col1], survivors[col2])
corr_non_survivors, p_value_non_survivors = stats.spearmanr(non_survivors[col1], non_survivors[col2])
# Store the results
corr_results['Variable_1'].extend([col1, col1])
corr_results['Variable_2'].extend([col2, col2])
corr_results['Correlation'].extend([corr_survivors, corr_non_survivors])
corr_results['P_value'].extend([p_value_survivors, p_value_non_survivors])
corr_results['Group'].extend(['Survivors', 'Non-Survivors'])
corr_results['Test'].extend(['Spearman', 'Spearman'])
# Create a DataFrame from the results
corr_df = pd.DataFrame(corr_results)
# Print the DataFrame
print(corr_df)
Variable_1 Variable_2 Test Correlation P_value Group 0 SAPS II SOFA Spearman 0.596778 0.000000e+00 Survivors 1 SAPS II SOFA Spearman 0.675700 7.755076e-288 Non-Survivors 2 SAPS II OASIS Spearman 0.598171 0.000000e+00 Survivors 3 SAPS II OASIS Spearman 0.674522 1.814579e-286 Non-Survivors 4 SOFA OASIS Spearman 0.360364 0.000000e+00 Survivors 5 SOFA OASIS Spearman 0.453597 5.383949e-110 Non-Survivors
Density plot of ordinal variables¶
for column in ordinal_df.columns:
if column != 'Hospital Mortality':
# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# density plot for each group
sns.kdeplot(data=df[df['Hospital Mortality'] == 0][column], ax = ax, label='Alive', fill=True, color = 'g')
sns.kdeplot(data=df[df['Hospital Mortality'] == 1][column], ax = ax, label='Dead', fill=True, color = 'r')
sns.kdeplot(data=df[column], label='Overall Classes',ax = ax, fill=True,color='b')
plt.title(f'Density Plot for {column}')
plt.xlabel(column)
plt.ylabel('Density')
plt.legend()
plt.show()
Scatter Plot¶
# Create a scatter plot matrix for ordinal_df
sns.pairplot(ordinal_df, hue='Hospital Mortality', palette='Set1')
# Set the title of the plot
plt.suptitle('Scatter Plot Matrix for Ordinal Variables')
# Display the plot
plt.show()
Chi-Square test¶
for column in binary_df.columns[1:]:
contingency_table = pd.crosstab(df['Hospital Mortality'], df[column])
chi2, p, _, _ = chi2_contingency(contingency_table)
print()
print(contingency_table)
print()
print(f"Chi-square test between Hospital Mortality and {column}:")
print(f"Chi2 value: {chi2}")
print(f"P-value: {p}")
print("--------------------------------------------------------")
# Append results to the dictionary
result_dict['Variable'].append(column)
result_dict['Data_Type'].append('Categorical')
result_dict['Type_of_Test'].append('Chi-Square')
result_dict['P-value'].append(p)
# result_dict_n['Variable'].append(column)
# result_dict_n['Data_Type'].append('Categorical')
# result_dict_n['Type_of_Test'].append('Chi-Square')
# result_dict_n['P-value'].append(p)
Gender F M Hospital Mortality 0 3938 6393 1 953 1205 Chi-square test between Hospital Mortality and Gender: Chi2 value: 27.10758978619394 P-value: 1.9244091683381194e-07 -------------------------------------------------------- Uncomplicated Hypertension 0 1 Hospital Mortality 0 5635 4696 1 1366 792 Chi-square test between Hospital Mortality and Uncomplicated Hypertension: Chi2 value: 55.189196979179414 P-value: 1.0946874261933157e-13 -------------------------------------------------------- Complicated Hypertension 0 1 Hospital Mortality 0 9486 845 1 1941 217 Chi-square test between Hospital Mortality and Complicated Hypertension: Chi2 value: 7.838345176804054 P-value: 0.005114940941223975 -------------------------------------------------------- Uncomplicated Diabetes 0 1 Hospital Mortality 0 8273 2058 1 1740 418 Chi-square test between Hospital Mortality and Uncomplicated Diabetes: Chi2 value: 0.30699355812333073 P-value: 0.5795309446027188 -------------------------------------------------------- Complicated Diabetes 0 1 Hospital Mortality 0 9771 560 1 2050 108 Chi-square test between Hospital Mortality and Complicated Diabetes: Chi2 value: 0.5306520803096673 P-value: 0.4663328403623749 -------------------------------------------------------- Malignancy 0 1 Hospital Mortality 0 9265 1066 1 1756 402 Chi-square test between Hospital Mortality and Malignancy: Chi2 value: 118.04115471197295 P-value: 1.698274709058758e-27 -------------------------------------------------------- Hematologic Disease 0 1 Hospital Mortality 0 8820 1511 1 1610 548 Chi-square test between Hospital Mortality and Hematologic Disease: Chi2 value: 149.55075555628915 P-value: 2.1734758882909007e-34 -------------------------------------------------------- Metastasis 0 1 Hospital Mortality 0 9921 410 1 1955 203 Chi-square test between Hospital Mortality and Metastasis: Chi2 value: 111.94872842345087 P-value: 3.666710397229884e-26 -------------------------------------------------------- Peripheral Vascular Disease 0 1 Hospital Mortality 0 9465 866 1 1983 175 Chi-square test between Hospital Mortality and Peripheral Vascular Disease: Chi2 value: 0.14043292179104155 P-value: 0.7078510082500947 -------------------------------------------------------- Hypothyroidism 0 1 Hospital Mortality 0 9442 889 1 1986 172 Chi-square test between Hospital Mortality and Hypothyroidism: Chi2 value: 0.8455723053468485 P-value: 0.3578079363406852 -------------------------------------------------------- Chronic Heart Failure 0 1 Hospital Mortality 0 7965 2366 1 1539 619 Chi-square test between Hospital Mortality and Chronic Heart Failure: Chi2 value: 32.494673323998754 P-value: 1.1951967205829576e-08 -------------------------------------------------------- Stroke 0 1 Hospital Mortality 0 9810 521 1 1985 173 Chi-square test between Hospital Mortality and Stroke: Chi2 value: 29.512849417030985 P-value: 5.5547217143098614e-08 -------------------------------------------------------- Liver Disease 0 1 Hospital Mortality 0 9430 901 1 1721 437 Chi-square test between Hospital Mortality and Liver Disease: Chi2 value: 246.83972334726226 P-value: 1.2688950646354187e-55 -------------------------------------------------------- Sepsis 0 1 Hospital Mortality 0 8872 1459 1 1356 802 Chi-square test between Hospital Mortality and Sepsis: Chi2 value: 637.6686458256187 P-value: 1.0739326128684081e-140 -------------------------------------------------------- Any Organ Failure 0 1 Hospital Mortality 0 5039 5292 1 423 1735 Chi-square test between Hospital Mortality and Any Organ Failure: Chi2 value: 616.2527393356861 P-value: 4.884011258785986e-136 -------------------------------------------------------- Severe Respiratory Failure 0 1 Hospital Mortality 0 9699 632 1 1794 364 Chi-square test between Hospital Mortality and Severe Respiratory Failure: Chi2 value: 279.62518427423595 P-value: 9.063066574953642e-63 -------------------------------------------------------- Severe Coagulation Failure 0 1 Hospital Mortality 0 10298 33 1 2110 48 Chi-square test between Hospital Mortality and Severe Coagulation Failure: Chi2 value: 97.58692892932098 P-value: 5.1543055964643073e-23 -------------------------------------------------------- Severe Liver Failure 0 1 Hospital Mortality 0 10260 71 1 2065 93 Chi-square test between Hospital Mortality and Severe Liver Failure: Chi2 value: 177.95721250164928 P-value: 1.353497596467662e-40 -------------------------------------------------------- Severe Cardiovascular Failure 0 1 Hospital Mortality 0 9267 1064 1 1383 775 Chi-square test between Hospital Mortality and Severe Cardiovascular Failure: Chi2 value: 930.6517936766536 P-value: 2.1311389141756498e-204 -------------------------------------------------------- Severe Central Nervous System Failure 0 1 Hospital Mortality 0 9804 527 1 1949 209 Chi-square test between Hospital Mortality and Severe Central Nervous System Failure: Chi2 value: 66.80535560379377 P-value: 2.9968284695977144e-16 -------------------------------------------------------- Severe Renal Failure 0 1 Hospital Mortality 0 9964 367 1 1839 319 Chi-square test between Hospital Mortality and Severe Renal Failure: Chi2 value: 431.49833268868963 P-value: 7.66968217562407e-96 -------------------------------------------------------- Respiratory Dysfunction 0 1 Hospital Mortality 0 7728 2603 1 1083 1075 Chi-square test between Hospital Mortality and Respiratory Dysfunction: Chi2 value: 519.5454689895382 P-value: 5.314118235626247e-115 -------------------------------------------------------- Cardiovascular Dysfunction 0 1 Hospital Mortality 0 9104 1227 1 1381 777 Chi-square test between Hospital Mortality and Cardiovascular Dysfunction: Chi2 value: 769.6863056545292 P-value: 2.103599616429413e-169 -------------------------------------------------------- Renal Dysfunction 0 1 Hospital Mortality 0 7861 2470 1 1102 1056 Chi-square test between Hospital Mortality and Renal Dysfunction: Chi2 value: 550.5302091270887 P-value: 9.652686387061328e-122 -------------------------------------------------------- Hematologic Dysfunction 0 1 Hospital Mortality 0 9286 1045 1 1709 449 Chi-square test between Hospital Mortality and Hematologic Dysfunction: Chi2 value: 192.72721363359676 P-value: 8.073483031253256e-44 -------------------------------------------------------- Metabolic Dysfunction 0 1 Hospital Mortality 0 9289 1042 1 1674 484 Chi-square test between Hospital Mortality and Metabolic Dysfunction: Chi2 value: 252.36956043917155 P-value: 7.904304859920673e-57 -------------------------------------------------------- Neurologic Dysfunction 0 1 Hospital Mortality 0 9391 940 1 1854 304 Chi-square test between Hospital Mortality and Neurologic Dysfunction: Chi2 value: 48.97268661992234 P-value: 2.595517465197432e-12 --------------------------------------------------------
Box Plot of binary variables¶
for column in cont_all_df.columns:
if column != 'Hospital Mortality':
sns.boxplot(x='Hospital Mortality', y=column, data=cont_all_df, hue='Hospital Mortality', legend=False)
plt.title(f'Box Plot for {column} by Hospital Mortality')
plt.show()
Summary Table for Statistical tests conducted¶
vital_signs = ['Max Heart Rate', 'Min Heart Rate', 'Mean Heart Rate', 'Max MAP', 'Min MAP', 'Mean MAP', 'Max Systolic Pressure', 'Min Systolic Pressure', 'Mean Systolic Pressure', 'Max Diastolic Pressure', 'Min Diastolic Pressure', 'Mean Diastolic Pressure', 'Max Temperature', 'Min Temperature', 'Mean Temperature']
demographic = ['Age', 'Gender']
diagnosis = ['Sepsis', 'Any Organ Failure', 'Severe Respiratory Failure', 'Severe Coagulation Failure', 'Severe Liver Failure', 'Severe Cardiovascular Failure', 'Severe Central Nervous System Failure', 'Severe Renal Failure', 'Respiratory Dysfunction', 'Cardiovascular Dysfunction', 'Renal Dysfunction', 'Hematologic Dysfunction', 'Metabolic Dysfunction', 'Neurologic Dysfunction']
severity = ['SAPS II', 'SOFA', 'OASIS']
lab = ['Max Lactate', 'Min Lactate', 'Mean Lactate', 'Max pH', 'Min pH', 'Mean pH', 'Max Glucose', 'Min Glucose', 'Mean Glucose', 'Max WBC', 'Min WBC', 'Mean WBC', 'Max BUN', 'Min BUN', 'Mean BUN', 'Max Creatinine', 'Min Creatinine', 'Mean Creatinine', 'Max Hemoglobin', 'Min Hemoglobin', 'Mean Hemoglobin']
history = ['Uncomplicated Hypertension', 'Complicated Hypertension', 'Uncomplicated Diabetes', 'Complicated Diabetes', 'Malignancy', 'Hematologic Disease', 'Metastasis', 'Peripheral Vascular Disease', 'Hypothyroidism', 'Chronic Heart Failure', 'Stroke', 'Liver Disease']
summary_of_tests = pd.DataFrame(result_dict)
def categorize_variable(variable):
if variable in vital_signs:
return 'Vital signs'
elif variable in demographic:
return 'Demographic'
elif variable in diagnosis:
return 'Diagnosis'
elif variable in severity:
return 'Severity'
elif variable in lab:
return 'Laboratory results'
elif variable in history:
return 'Medical history'
else:
return 'Other'
# Apply the function
summary_of_tests['Category'] = summary_of_tests['Variable'].apply(categorize_variable)
summary_of_tests.insert(1, 'Category', summary_of_tests.pop('Category'))
summary_of_tests.sort_values(by='Category', inplace=True)
# summary_of_tests['P-value'] = round(summary_of_tests['P-value'], 5)
print(summary_of_tests)
# print(type(summary_of_tests['P-value']))
Variable Category Data_Type \
40 Gender Demographic Categorical
8 Age Demographic Continuous
66 Neurologic Dysfunction Diagnosis Categorical
65 Metabolic Dysfunction Diagnosis Categorical
53 Sepsis Diagnosis Categorical
55 Severe Respiratory Failure Diagnosis Categorical
56 Severe Coagulation Failure Diagnosis Categorical
57 Severe Liver Failure Diagnosis Categorical
54 Any Organ Failure Diagnosis Categorical
59 Severe Central Nervous System Failure Diagnosis Categorical
60 Severe Renal Failure Diagnosis Categorical
61 Respiratory Dysfunction Diagnosis Categorical
62 Cardiovascular Dysfunction Diagnosis Categorical
63 Renal Dysfunction Diagnosis Categorical
58 Severe Cardiovascular Failure Diagnosis Categorical
64 Hematologic Dysfunction Diagnosis Categorical
28 Max WBC Laboratory results Continuous
29 Min WBC Laboratory results Continuous
30 Mean WBC Laboratory results Continuous
31 Max BUN Laboratory results Continuous
34 Min Creatinine Laboratory results Continuous
35 Mean Creatinine Laboratory results Continuous
36 Min Hemoglobin Laboratory results Continuous
27 Mean Glucose Laboratory results Continuous
32 Min BUN Laboratory results Continuous
26 Min Glucose Laboratory results Continuous
33 Max Creatinine Laboratory results Continuous
24 Mean pH Laboratory results Continuous
25 Max Glucose Laboratory results Continuous
6 Max Hemoglobin Laboratory results Continuous
5 Mean BUN Laboratory results Continuous
19 Max Lactate Laboratory results Continuous
7 Mean Hemoglobin Laboratory results Continuous
21 Mean Lactate Laboratory results Continuous
22 Max pH Laboratory results Continuous
23 Min pH Laboratory results Continuous
20 Min Lactate Laboratory results Continuous
52 Liver Disease Medical history Categorical
51 Stroke Medical history Categorical
50 Chronic Heart Failure Medical history Categorical
49 Hypothyroidism Medical history Categorical
48 Peripheral Vascular Disease Medical history Categorical
47 Metastasis Medical history Categorical
45 Malignancy Medical history Categorical
46 Hematologic Disease Medical history Categorical
43 Uncomplicated Diabetes Medical history Categorical
42 Complicated Hypertension Medical history Categorical
41 Uncomplicated Hypertension Medical history Categorical
44 Complicated Diabetes Medical history Categorical
39 OASIS Severity Ordinal
38 SOFA Severity Ordinal
37 SAPS II Severity Ordinal
1 Mean Heart Rate Vital signs Continuous
2 Mean MAP Vital signs Continuous
3 Mean Systolic Pressure Vital signs Continuous
4 Mean Diastolic Pressure Vital signs Continuous
18 Mean Temperature Vital signs Continuous
17 Min Temperature Vital signs Continuous
11 Min MAP Vital signs Continuous
9 Min Heart Rate Vital signs Continuous
10 Max MAP Vital signs Continuous
12 Max Systolic Pressure Vital signs Continuous
13 Min Systolic Pressure Vital signs Continuous
15 Min Diastolic Pressure Vital signs Continuous
14 Max Diastolic Pressure Vital signs Continuous
16 Max Temperature Vital signs Continuous
0 Max Heart Rate Vital signs Continuous
Type_of_Test P-value
40 Chi-Square 1.924409e-07
8 Mann-Whitney U 2.152532e-66
66 Chi-Square 2.595517e-12
65 Chi-Square 7.904305e-57
53 Chi-Square 1.073933e-140
55 Chi-Square 9.063067e-63
56 Chi-Square 5.154306e-23
57 Chi-Square 1.353498e-40
54 Chi-Square 4.884011e-136
59 Chi-Square 2.996828e-16
60 Chi-Square 7.669682e-96
61 Chi-Square 5.314118e-115
62 Chi-Square 2.103600e-169
63 Chi-Square 9.652686e-122
58 Chi-Square 2.131139e-204
64 Chi-Square 8.073483e-44
28 Mann-Whitney U 1.111790e-13
29 Mann-Whitney U 2.281026e-15
30 Mann-Whitney U 3.781638e-15
31 Mann-Whitney U 2.023735e-170
34 Mann-Whitney U 3.991095e-94
35 Mann-Whitney U 4.675795e-110
36 Mann-Whitney U 4.403853e-04
27 Mann-Whitney U 2.594076e-57
32 Mann-Whitney U 1.765163e-161
26 Mann-Whitney U 2.284125e-38
33 Mann-Whitney U 6.435700e-118
24 Mann-Whitney U 2.205225e-47
25 Mann-Whitney U 3.594211e-31
6 Welch's T-test 8.444631e-21
5 Welch's T-test 4.778132e-145
19 Mann-Whitney U 6.177761e-84
7 Welch's T-test 4.100471e-01
21 Mann-Whitney U 1.694887e-114
22 Mann-Whitney U 4.311983e-29
23 Mann-Whitney U 2.144696e-47
20 Mann-Whitney U 1.404491e-153
52 Chi-Square 1.268895e-55
51 Chi-Square 5.554722e-08
50 Chi-Square 1.195197e-08
49 Chi-Square 3.578079e-01
48 Chi-Square 7.078510e-01
47 Chi-Square 3.666710e-26
45 Chi-Square 1.698275e-27
46 Chi-Square 2.173476e-34
43 Chi-Square 5.795309e-01
42 Chi-Square 5.114941e-03
41 Chi-Square 1.094687e-13
44 Chi-Square 4.663328e-01
39 Mann-Whitney U 6.727800e-308
38 Mann-Whitney U 3.564756e-178
37 Mann-Whitney U 0.000000e+00
1 Welch's T-test 2.113293e-22
2 Welch's T-test 5.409361e-23
3 Welch's T-test 4.935026e-29
4 Welch's T-test 4.528457e-18
18 Mann-Whitney U 1.312550e-15
17 Mann-Whitney U 4.344243e-17
11 Mann-Whitney U 5.902742e-94
9 Mann-Whitney U 6.405818e-01
10 Mann-Whitney U 7.401548e-01
12 Mann-Whitney U 9.964243e-02
13 Mann-Whitney U 1.350053e-91
15 Mann-Whitney U 3.890390e-87
14 Mann-Whitney U 7.300896e-02
16 Mann-Whitney U 1.139081e-05
0 Welch's T-test 1.513703e-45
summary_of_tests['Category'].value_counts()
Laboratory results 21 Vital signs 15 Diagnosis 14 Medical history 12 Severity 3 Demographic 2 Name: Category, dtype: int64
# Export the DataFrame to an Excel file
summary_of_tests.to_excel('summary_of_tests.xlsx', index=False)